[00:02:23] !jouncebot next [00:02:23] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [00:02:40] jouncebot: next [00:02:40] In 5 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600) [00:02:41] In 5 hour(s) and 57 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600) [00:03:10] If there are no complaints, I'm going to undeploy a mitigation for search-traffic in mediawiki-config [00:03:29] (there is now a requestctl rule addressing the issue, and followup heuristics in cirrus to be deployed next week) [00:04:04] (03PS2) 10Ebernhardson: Revert "cirrus: Send more_like traffic to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191168 (https://phabricator.wikimedia.org/T405394) [00:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:17] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212532 (10phaultfinder) [00:05:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191168 (https://phabricator.wikimedia.org/T405394) (owner: 10Ebernhardson) [00:07:03] (03Merged) 10jenkins-bot: Revert "cirrus: Send more_like traffic to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191168 (https://phabricator.wikimedia.org/T405394) (owner: 10Ebernhardson) [00:07:41] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1191168|Revert "cirrus: Send more_like traffic to eqiad" (T405394)]] [00:07:48] T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394 [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197 (owner: 10TrainBranchBot) [00:11:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1220.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:12:05] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1191168|Revert "cirrus: Send more_like traffic to eqiad" (T405394)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:12:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1217.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:12:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1215.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:12:45] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [00:13:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1219.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:13:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1224.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:13:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1218.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:14:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1223.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:15:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1225.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:15:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1226.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:17:30] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191168|Revert "cirrus: Send more_like traffic to eqiad" (T405394)]] (duration: 09m 48s) [00:17:36] T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394 [00:19:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1221.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:19:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1227.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:21:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1222.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:21:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1228.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:24:43] (03PS1) 10RLazarus: wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) [00:25:54] (03PS2) 10RLazarus: wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) [00:33:19] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197 (owner: 10TrainBranchBot) [00:34:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212563 (10phaultfinder) [00:39:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212569 (10phaultfinder) [00:39:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11212570 (10phaultfinder) [00:40:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1224.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:41:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1223.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:41:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1225.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:42:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1226.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:42:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1229.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1230.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1232.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212571 (10phaultfinder) [00:45:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1231.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:46:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1227.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:48:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1228.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:48:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1216.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:54:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212573 (10phaultfinder) [01:17:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212619 (10Jclark-ctr) [01:20:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye [01:20:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1210.eqiad.wmnet with OS bullseye [01:20:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1210.eqiad.wmnet with OS bullseye [01:20:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1210.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [01:29:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye [01:29:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1210.eqiad.wmnet with OS bullseye [01:29:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212627 (10phaultfinder) [01:30:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212628 (10phaultfinder) [01:31:12] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [01:33:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212635 (10Jclark-ctr) @BTullis I’ve only set up RAID1 for an-worker1210. I wanted to get one running before the end of the night, but I’m not having any luck. Could you help me wit... [01:41:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [01:45:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11212636 (10phaultfinder) [01:52:44] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11212641 (10Krinkle) >>! In T122097#2657531, @BBlack wrote: > This has been idle a while, but it's still probably a good... [01:54:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212646 (10phaultfinder) [01:54:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212647 (10phaultfinder) [01:59:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11212651 (10phaultfinder) [02:09:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11212654 (10phaultfinder) [02:12:17] FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:17:17] FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:19:54] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212655 (10phaultfinder) [02:24:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212656 (10phaultfinder) [02:27:17] FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:17] FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:49:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11212660 (10phaultfinder) [02:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:30] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:59:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212661 (10phaultfinder) [03:00:20] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:03:30] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:03:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:06:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:20:20] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212667 (10phaultfinder) [03:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212674 (10phaultfinder) [04:26:28] (03CR) 10Ladsgroup: "Yup. I can take care of it if you focus on getting mw code adopted." [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [04:44:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212699 (10phaultfinder) [04:45:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:46:19] Sorta here [04:46:25] In airport... [04:49:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:50:34] okay, got the laptop up. Looking... [04:50:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:53:15] mw errors has jumped but I can't see any jump in logstash :/ https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&from=now-6h&to=now&timezone=utc&var-site=$__all&var-deployment=mw-web&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki&refresh=1m&viewPanel=panel-63 [04:54:10] <_joe_> Amir1: have you seen the numbers? [04:54:31] it's low but it's paging because wdqs updater is falling behind [04:54:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:55:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:56:14] !incidents [04:56:14] 6795 (UNACKED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet codfw) [04:56:14] 6787 (RESOLVED) [2x] ProbeDown sre (dse-k8s-ctrl2002:6443 probes/custom codfw) [04:56:24] calling search platform [04:58:22] !ack 6795 [04:58:22] 6795 (ACKED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet codfw) [04:58:28] Acking so it stops paging me [04:58:33] Calling Ryan [04:59:18] <_joe_> Amir1: you shouldn't be the only person being paged. it's 10 pm on the US west coast, I get paged until 11 pm [04:59:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:00:49] <_joe_> Amir1: it's recovering btw [05:01:07] <_joe_> https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-3c-3e-origin-servers-overview?orgId=1&from=now-24h&to=now&timezone=utc&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet&var-origin=wdqs-scholarly.discovery.wmnet&var-origin=wdqs.discovery.wmnet&viewPanel=panel-12 [05:01:45] <_joe_> ehhh not really actually [05:03:16] guillaume is waking up and will call someone [05:03:57] <_joe_> All wdqs-main servers in codfw are marked partially up [05:04:20] <_joe_> Fetch failed (https://localhost/readiness-probe) [05:04:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:05:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11212705 (10SD0001) Got the email, and have reset the temporary password. Thanks! [05:05:10] <_joe_> load on the servers is in the 100s [05:05:24] ok, I?m here... [05:05:43] thanks [05:05:54] I think all codfw updaters are broken it seems? [05:05:58] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:07:13] so, the updater itself seems ok, but the WDQS servers are overloaded and can't apply updates? [05:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212706 (10phaultfinder) [05:10:29] we're sending all traffic to codfw, so more load per server than usual. We should be provisionned to handle that, but maybe we're not. [05:10:58] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:12:57] I'm trying to get hold of David. [05:14:49] WDQS SLO are low enough, it's not the end of the world if it is down for a few hours. [05:14:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212707 (10phaultfinder) [05:15:58] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:17:33] I have to board the plane. I don't have access for a couple of hours [05:17:48] the page is acked so it shouldn't wake up anyone else [05:18:03] (unless it doesn't resolve in 24 hours, which I hope not) [05:18:54] Amir1: thanks a lot ! [05:19:06] <_joe_> I'll take a look at traffic patterns in the meantime [05:21:22] (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-09-25-051716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191231 (https://phabricator.wikimedia.org/T394982) [05:22:08] Updating cxserver/staging ^ [05:23:19] <_joe_> gehel: is there a cookbook/runbook to roll restart blazegraph? this looks like a query-of-death situation tbh [05:23:30] (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-09-25-051716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191231 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [05:23:32] <_joe_> traffic wasn't particularly elevated when this happened [05:24:24] There should be one [05:24:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212712 (10phaultfinder) [05:25:01] I have to feed the kids and send them to school. I'll be back in 40' [05:25:11] (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-09-25-051716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191231 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [05:25:54] <_joe_> Oh I have stuff to do too.... I guess I'll get going. [05:26:40] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [05:27:02] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:29:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:30:54] !log staging: Updated cxserver to 2025-09-25-051716-production (T394982) [05:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:02] T394982: Migrate cxserver in production to node22 - https://phabricator.wikimedia.org/T394982 [05:38:18] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543 (10Superpes15) 03NEW [05:39:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [05:45:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11212737 (10phaultfinder) [05:55:58] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:59:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600) [06:00:05] marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0600). [06:04:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:04:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11212755 (10phaultfinder) [06:14:09] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:15:32] (03PS1) 10Kosta Harlan: CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191233 (https://phabricator.wikimedia.org/T405342) [06:16:03] Amir1: are you deploying now, or can I sync something? [06:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:19:54] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:20:03] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:20:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191233 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan) [06:21:23] (03Merged) 10jenkins-bot: CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191233 (https://phabricator.wikimedia.org/T405342) (owner: 10Kosta Harlan) [06:21:54] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1191233|CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis (T405342)]] [06:22:00] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [06:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:24:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:26:23] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1191233|CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis (T405342)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:28:19] !log kharlan@deploy1003 kharlan: Continuing with sync [06:29:58] (03PS2) 10Giuseppe Lavagetto: tls: ban default UAs with forge URLs [puppet] - 10https://gerrit.wikimedia.org/r/1190004 (https://phabricator.wikimedia.org/T400119) [06:31:31] !log restarting blazegraph on wdqs-main@codfw [06:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:30] (03CR) 10Muehlenhoff: "maps2011-maps2014 are now fully replicated, good to merge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190578 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [06:33:16] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191233|CheckUser/UserInfoCard: Phase 2 enable by default on pilot wikis (T405342)]] (duration: 11m 22s) [06:33:22] T405342: Enable UserInfoCard by default on a set of wikis - https://phabricator.wikimedia.org/T405342 [06:33:34] (03PS1) 10TChin: [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) [06:33:59] (03CR) 10Giuseppe Lavagetto: "Because the logic is more complex than what that would imply." [puppet] - 10https://gerrit.wikimedia.org/r/1190004 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [06:34:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:34:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:35:11] (03CR) 10Giuseppe Lavagetto: [C:03+2] tls: ban default UAs with forge URLs [puppet] - 10https://gerrit.wikimedia.org/r/1190004 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [06:35:58] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:37:17] RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:40:58] RESOLVED: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:43:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [06:44:14] aux-k8s-etcd1003, dse-k8s-etcd1001 and kubestagemaster1005 will go down for a Ganeti reboot [06:44:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [06:45:52] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:46:14] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [06:46:32] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [06:50:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [06:50:28] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [06:50:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [06:50:54] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [06:50:57] FIRING: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:51:16] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [06:52:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [06:54:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11212804 (10phaultfinder) [06:55:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:55:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1005.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1005.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:57:30] (03PS21) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [06:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:45] jmm@cumin2002 drain-node (PID 172439) is awaiting input [07:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T0700). [07:00:05] James_F, sergi0, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] Heya. [07:00:15] o/ [07:00:32] sergi0: Did you want to deploy? You should go first either way. Happy to do it. [07:00:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [07:01:31] (03CR) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [07:03:24] James_F go for it, I will test [07:03:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:49] Hmm. [07:04:10] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1190698 depends on a MetricsPlatform patch, so you'll need that cherry-picked too? [07:05:27] … which doesn't cherry-pick cleanly, eurgh. [07:05:35] hmm, I should have removed the depends, the MP patch is already in wmf.20 as far I can see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MetricsPlatform/+/1189522 [07:05:40] sergi0: Can you create the cherry-picks? [07:05:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:06:03] Oh, right, but SpiderPig thinks it isn't. [07:06:12] Yeah, I'll just drop the dependency. [07:06:24] (03PS2) 10Jforrester: ExperimentXLabManager: allow to re-enroll a user in experiments [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno) [07:06:30] Sorry about that, ty! [07:06:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno) [07:06:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [07:06:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:43] No worries at all! [07:06:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [07:07:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [07:09:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212819 (10phaultfinder) [07:10:56] jmm@cumin2002 drain-node (PID 180950) is awaiting input [07:13:25] RESOLVED: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [07:18:00] (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [07:18:59] (03CR) 10CI reject: [V:04-1] ExperimentXLabManager: allow to re-enroll a user in experiments [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno) [07:19:10] Eurgh. [07:19:48] sergi0: Do the API tests sometimes fail like this for GrowthExperiments, or is this CI failure likely real? [07:20:38] I'll do the simple config patches whilst we work that out. [07:20:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190702 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno) [07:20:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189617 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx) [07:20:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189616 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx) [07:21:09] Looking into it, I had not seen that error on GE before [07:21:21] I can C+2 it again and see if it passes. [07:21:49] (03Merged) 10jenkins-bot: Growth [testwiki]: enable new notifications and reduce scheduling time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190702 (https://phabricator.wikimedia.org/T404085) (owner: 10Sergio Gimeno) [07:21:50] (03CR) 10Jforrester: [C:03+2] "Let's try this again." [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno) [07:21:51] (03Merged) 10jenkins-bot: mswikiquote: set timezone, sitename and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189617 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx) [07:21:56] (03Merged) 10jenkins-bot: mswikiquote: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189616 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx) [07:22:02] OK, first batch going ahead now. [07:22:16] (03PS1) 10Slyngshede: Revert "P:puppetserver::volatile Include XCheeseScore private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1191239 [07:22:25] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1190702|Growth [testwiki]: enable new notifications and reduce scheduling time (T404085)]], [[gerrit:1189617|mswikiquote: set timezone, sitename and project namespace (T404700)]], [[gerrit:1189616|mswikiquote: add logo (T404700)]] [07:22:34] T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085 [07:22:34] T404700: Post-creation work for mswikiquote - https://phabricator.wikimedia.org/T404700 [07:22:37] sergi0: Please be ready to test the notifications on testwiki in a minute or two. [07:23:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [07:23:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [07:24:03] (03Merged) 10jenkins-bot: ExperimentXLabManager: allow to re-enroll a user in experiments [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190698 (https://phabricator.wikimedia.org/T401308) (owner: 10Sergio Gimeno) [07:24:03] James_F: ack [07:24:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [07:24:24] Aha, cool, next time I sync it'll pick that up. [07:26:01] James_F: mswikiquite logo and other changes looks good to sync [07:26:15] anzx: Cool, thank you. [07:28:00] (03PS1) 10Slyngshede: P::puppetserver::volatile fix xcheesescore repo path [puppet] - 10https://gerrit.wikimedia.org/r/1191240 (https://phabricator.wikimedia.org/T404688) [07:28:08] jmm@cumin2002 drain-node (PID 188053) is awaiting input [07:28:41] (03PS4) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) [07:28:47] !log jforrester@deploy1003 anzx, jforrester, sgimeno: Backport for [[gerrit:1190702|Growth [testwiki]: enable new notifications and reduce scheduling time (T404085)]], [[gerrit:1189617|mswikiquote: set timezone, sitename and project namespace (T404700)]], [[gerrit:1189616|mswikiquote: add logo (T404700)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:28:56] T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085 [07:28:57] T404700: Post-creation work for mswikiquote - https://phabricator.wikimedia.org/T404700 [07:29:20] sergi0: Please check. [07:29:25] on it [07:29:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [07:30:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212862 (10phaultfinder) [07:30:15] (03CR) 10Slyngshede: [C:03+2] P::puppetserver::volatile fix xcheesescore repo path [puppet] - 10https://gerrit.wikimedia.org/r/1191240 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [07:30:56] (03PS1) 10Muehlenhoff: Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) [07:31:12] James_F: lgtm [07:31:15] Ack. [07:31:17] !log jforrester@deploy1003 anzx, jforrester, sgimeno: Continuing with sync [07:33:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:34:32] James_F: please run namespacedupes for mswikiquote after completion of sync [07:34:41] anzx: Ack. [07:35:49] (03PS1) 10Slyngshede: P:puppetserver::volatile repo -> repos [puppet] - 10https://gerrit.wikimedia.org/r/1191243 [07:35:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [07:36:01] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190702|Growth [testwiki]: enable new notifications and reduce scheduling time (T404085)]], [[gerrit:1189617|mswikiquote: set timezone, sitename and project namespace (T404700)]], [[gerrit:1189616|mswikiquote: add logo (T404700)]] (duration: 13m 36s) [07:36:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [07:36:10] T404085: Release Plan for Growth's notification A/B test - https://phabricator.wikimedia.org/T404085 [07:36:11] T404700: Post-creation work for mswikiquote - https://phabricator.wikimedia.org/T404700 [07:36:49] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1190698|ExperimentXLabManager: allow to re-enroll a user in experiments (T401308)]] [07:36:55] !log jforrester@deploy1003 mwscript-k8s job started: namespaceDupes mswikiquote --fix # T404700 [07:36:56] T401308: Create A/B test experiment for leveling up notifications - https://phabricator.wikimedia.org/T401308 [07:37:30] James_F: thanks for deploying [07:37:35] anzx: Would the list of moved pages be useful? None look surprising. [07:38:04] not required, all pages looks correctly moved [07:38:12] Excellent [07:38:21] Thank you James! [07:38:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [07:38:47] Deploying to debug servers now. [07:40:07] (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile repo -> repos [puppet] - 10https://gerrit.wikimedia.org/r/1191243 (owner: 10Slyngshede) [07:42:42] !log jforrester@deploy1003 jforrester, sgimeno: Backport for [[gerrit:1190698|ExperimentXLabManager: allow to re-enroll a user in experiments (T401308)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:42:50] T401308: Create A/B test experiment for leveling up notifications - https://phabricator.wikimedia.org/T401308 [07:43:35] sergi0: Please check. [07:43:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [07:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:44:51] (03PS1) 10Slyngshede: P:puppetserver::volatile xcheesescore main branch [puppet] - 10https://gerrit.wikimedia.org/r/1191246 [07:45:01] on it [07:46:01] Thanks. [07:46:56] (03PS1) 10Ryan Kemper: Simplify make_api_call function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191247 [07:46:56] (03PS1) 10Ryan Kemper: Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 [07:47:15] (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile xcheesescore main branch [puppet] - 10https://gerrit.wikimedia.org/r/1191246 (owner: 10Slyngshede) [07:49:45] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [07:49:48] !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [07:49:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [07:49:58] James_F: good from my side [07:49:59] (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191249 (https://phabricator.wikimedia.org/T394982) [07:50:03] !log jforrester@deploy1003 jforrester, sgimeno: Continuing with sync [07:50:05] Cool. [07:50:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [07:50:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [07:50:44] Let's see how swiftly we can get the Graph removal landed. [07:51:10] (03PS3) 10Jforrester: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) [07:51:41] Minor cxserver deployment.. [07:52:00] (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191249 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [07:52:14] !log brouberol@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [07:52:19] (03CR) 10Jforrester: [C:03+2] Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [07:52:26] !log brouberol@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [07:52:37] yay [07:53:20] (03Merged) 10jenkins-bot: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [07:53:44] (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191249 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [07:54:52] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190698|ExperimentXLabManager: allow to re-enroll a user in experiments (T401308)]] (duration: 18m 03s) [07:54:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212915 (10phaultfinder) [07:55:00] T401308: Create A/B test experiment for leveling up notifications - https://phabricator.wikimedia.org/T401308 [07:55:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [07:55:27] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1184797|Stop loading the Graph extension anywhere (T362317)]] [07:55:32] T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317 [07:55:45] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11212920 (10fgiunchedi) a:05dcaro→03fgiunchedi [07:55:48] !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [07:56:07] (03CR) 10CI reject: [V:04-1] Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper) [07:56:16] (03CR) 10CI reject: [V:04-1] Simplify make_api_call function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191247 (owner: 10Ryan Kemper) [07:56:28] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [07:56:52] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:56:57] !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [07:58:02] !log staging: Updated cxserver to 2025-09-25-074241-production (T394982) [07:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:07] T394982: Migrate cxserver in production to node22 - https://phabricator.wikimedia.org/T394982 [07:58:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:59:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11212940 (10phaultfinder) [08:01:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [08:01:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [08:01:38] (03PS1) 10Slyngshede: D:git::clone add environment to pull command [puppet] - 10https://gerrit.wikimedia.org/r/1191291 (https://phabricator.wikimedia.org/T404688) [08:01:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [08:04:35] FIRING: DiskSpace: Disk space deploy1003:9100:/srv 3.632% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:42] (03CR) 10Brouberol: [C:03+1] idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:07:00] jmm@cumin2002 drain-node (PID 217655) is awaiting input [08:07:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:11:55] (03CR) 10Fabfur: [C:03+1] D:git::clone add environment to pull command [puppet] - 10https://gerrit.wikimedia.org/r/1191291 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [08:12:46] (03CR) 10Slyngshede: [C:03+2] D:git::clone add environment to pull command [puppet] - 10https://gerrit.wikimedia.org/r/1191291 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [08:14:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [08:14:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11212984 (10Nikerabbit) [08:14:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212985 (10phaultfinder) [08:17:59] (03CR) 10Filippo Giunchedi: [C:03+1] "Patch LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [08:18:10] (03CR) 10Reedy: [C:03+1] OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [08:20:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [08:20:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [08:21:26] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1184797|Stop loading the Graph extension anywhere (T362317)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:21:31] T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317 [08:22:02] !log jforrester@deploy1003 jforrester: Continuing with sync [08:22:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [08:23:08] (03PS1) 10Tiziano Fogli: loki: increase ulimit nofile [puppet] - 10https://gerrit.wikimedia.org/r/1191300 (https://phabricator.wikimedia.org/T405552) [08:24:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1030.eqiad.wmnet [08:24:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [08:25:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1030.eqiad.wmnet [08:26:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [08:26:33] (03CR) 10Stevemunene: [V:03+2 C:03+2] idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:30:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [08:34:41] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184797|Stop loading the Graph extension anywhere (T362317)]] (duration: 39m 14s) [08:34:48] T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317 [08:34:52] Finally. [08:36:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [08:36:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [08:38:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [08:42:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [08:43:58] (03PS1) 10Ryan Kemper: Remove test_flush_markers_on_clusters_fail_synced [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191303 [08:43:58] (03PS1) 10Ryan Kemper: Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 [08:44:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [08:49:04] (03PS2) 10Ryan Kemper: Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 [08:49:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [08:49:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [08:49:56] (03CR) 10Elukey: [C:03+1] wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) (owner: 10RLazarus) [08:50:02] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213132 (10phaultfinder) [08:51:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [08:52:24] (03CR) 10CI reject: [V:04-1] Remove test_flush_markers_on_clusters_fail_synced [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191303 (owner: 10Ryan Kemper) [08:53:34] (03CR) 10CI reject: [V:04-1] Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 (owner: 10Ryan Kemper) [08:57:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [08:57:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [08:58:02] (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [08:58:34] (03CR) 10CI reject: [V:04-1] Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 (owner: 10Ryan Kemper) [08:58:50] (03CR) 10Tiziano Fogli: [C:03+2] "self-merge to fix the service" [puppet] - 10https://gerrit.wikimedia.org/r/1191300 (https://phabricator.wikimedia.org/T405552) (owner: 10Tiziano Fogli) [08:59:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [08:59:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:19] jmm@cumin2002 drain-node (PID 246015) is awaiting input [09:04:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [09:06:46] (03PS1) 10Btullis: Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490) [09:06:50] (03PS1) 10Btullis: Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490) [09:06:54] (03PS1) 10Btullis: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) [09:06:58] (03PS2) 10Btullis: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) [09:07:05] (03PS2) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [09:07:11] (03PS2) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [09:11:07] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11213197 (10jijiki) My conce... [09:12:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [09:12:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [09:12:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:13:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [09:13:16] (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [09:14:02] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543#11213203 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Done, and they look correct to me: ` root@ms-fe2009:~# for i in... [09:17:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [09:17:17] (03CR) 10Cathal Mooney: "Ha thanks yep that's all I need. Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:18:07] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [09:18:33] (03CR) 10Cathal Mooney: [C:03+2] Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:18:38] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [09:19:55] (03Merged) 10jenkins-bot: Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:19:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213226 (10phaultfinder) [09:20:04] (03PS15) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) [09:20:04] (03CR) 10Arnaudb: "the goal of this CR is:" [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [09:20:50] (03PS16) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) [09:22:36] (03CR) 10Elukey: [C:03+2] services: move kartotherian and tegola to the new codfw stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190578 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:23:03] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [09:23:13] jouncebot: next [09:23:13] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1000) [09:23:29] (03PS2) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [09:24:30] (03PS1) 10Ryan Kemper: WIP: rewriting test_force_allocation_of_all_unassigned_shards [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191310 [09:24:44] (03PS1) 10Mvolz: Update zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191311 [09:25:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [09:25:09] (03CR) 10Btullis: "The CI is failing with the following error:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:25:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [09:27:16] (03PS1) 10Dreamy Jazz: CheckUser: Enable SI special page on enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) [09:27:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz) [09:29:19] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [09:31:21] (03PS1) 10Elukey: services: move tegola's codfw postgres config to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191314 (https://phabricator.wikimedia.org/T381565) [09:31:49] (03PS2) 10Btullis: Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490) [09:31:50] (03PS2) 10Btullis: Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490) [09:31:50] (03PS2) 10Btullis: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) [09:31:50] (03PS3) 10Btullis: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) [09:31:51] (03PS3) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [09:31:55] (03PS3) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [09:31:59] (03PS3) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [09:33:05] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558 (10cmooney) 03NEW p:05Triage→03Medium [09:33:10] (03CR) 10CI reject: [V:04-1] WIP: rewriting test_force_allocation_of_all_unassigned_shards [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191310 (owner: 10Ryan Kemper) [09:33:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11213294 (10cmooney) [09:33:39] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11213293 (10cmooney) [09:34:40] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 13Patch-For-Review, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11213300 (10taavi) 05Open→03Resolved [09:38:11] (03PS2) 10Brouberol: Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper) [09:38:13] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:38:51] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:39:02] (03CR) 10Elukey: [C:03+2] services: move tegola's codfw postgres config to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191314 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:39:24] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [09:39:44] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:40:04] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:40:50] (03CR) 10CI reject: [V:04-1] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:41:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560 (10cmooney) 03NEW [09:41:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11213334 (10cmooney) [09:41:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11213333 (10cmooney) [09:42:14] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [09:44:37] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562 (10cmooney) 03NEW p:05Triage→03Medium [09:44:50] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11213369 (10cmooney) [09:44:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11213370 (10cmooney) [09:47:49] (03CR) 10CI reject: [V:04-1] Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper) [09:52:18] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [09:52:23] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11213413 (10cmooney) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1000) [10:01:20] (03PS1) 10Slyngshede: P:idp re-add NDA group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) [10:01:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:45] (03PS1) 10Pmiazga: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405544) [10:02:56] (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405544) (owner: 10Pmiazga) [10:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213498 (10phaultfinder) [10:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11213499 (10phaultfinder) [10:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:19:58] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:27:25] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11213539 (10elukey) >>! In T394357#11193379, @Jhancock.wm wrote: > anything i can try onsite to help? @Jhancock.wm not sure, I tried to upgrade the BIOS/BMC firmware + BMC reset... [10:28:08] * elukey lunch! [10:28:14] wrong chan :D [10:32:26] (03PS2) 10Slyngshede: P:idp add ops group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) [10:33:24] (03PS3) 10Slyngshede: P:idp add ops group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) [10:34:48] (03CR) 10Stevemunene: [C:03+2] idp: Register airflow-wikidata IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1190979 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [10:40:31] (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322 [10:41:00] (03CR) 10CI reject: [V:04-1] statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322 (owner: 10Lucas Werkmeister (WMDE)) [10:41:23] (03CR) 10Lucas Werkmeister (WMDE): "Disclaimer: I know very little puppet and don’t know if this change is correct, please review with caution! All I know is that we don’t ne" [puppet] - 10https://gerrit.wikimedia.org/r/1191322 (owner: 10Lucas Werkmeister (WMDE)) [10:42:18] (03PS2) 10Lucas Werkmeister (WMDE): statistics::wmde: Remove unused graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/1191322 [10:43:13] (03PS4) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [10:43:13] (03PS4) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [10:43:13] (03PS4) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [10:44:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213592 (10phaultfinder) [10:45:29] (03PS1) 10Filippo Giunchedi: interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) [10:45:31] (03PS1) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) [10:46:03] (03CR) 10CI reject: [V:04-1] interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [10:46:09] (03CR) 10CI reject: [V:04-1] wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [10:50:11] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213609 (10phaultfinder) [10:53:07] (03PS2) 10Filippo Giunchedi: interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) [10:53:07] (03PS2) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) [10:53:46] (03CR) 10CI reject: [V:04-1] wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [10:55:05] (03PS1) 10Clément Goubert: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) [10:56:21] (03PS3) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) [10:57:53] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543#11213639 (10Superpes15) [10:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11213650 (10phaultfinder) [10:59:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11213649 (10phaultfinder) [11:00:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [11:01:02] (03PS1) 10Sergio Gimeno: fix: provide a eventType fallback for already scheduled jobs [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191334 (https://phabricator.wikimedia.org/T405514) [11:01:23] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for newly created arbcom_plwiki - https://phabricator.wikimedia.org/T405543#11213658 (10MatthewVernon) [11:07:11] (03CR) 10Slyngshede: [C:03+2] P:idp add ops group for Netbox OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1191317 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [11:09:06] (03PS1) 10WMDE-Fisch: Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) [11:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213686 (10phaultfinder) [11:12:16] (03CR) 10Stevemunene: [C:03+2] Add airflow-wikidata namespace in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190974 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:14:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11213721 (10phaultfinder) [11:15:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11213722 (10phaultfinder) [11:18:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [11:20:24] (03Merged) 10jenkins-bot: Add airflow-wikidata namespace in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190974 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:21:10] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch) [11:21:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [11:21:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye [11:21:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1210.eqiad.wmnet with OS bullseye [11:22:58] (03PS1) 10Hnowlan: (api|rest)-gateway: Add option to disable CSP, disable for rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) [11:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11213772 (10phaultfinder) [11:25:29] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [11:25:42] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:26:44] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:29:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet [11:29:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [11:37:05] (03CR) 10A smart kitten: "(Just FYI, this apparently shouldn't've been merged as far before https://gerrit.wikimedia.org/r/1190347 as it was; see T405313#11210087)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [11:38:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:40:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [11:41:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:14] !log ladsgroup@cumin1003 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw for all core sections [11:42:30] (03PS2) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [11:42:41] (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [11:42:55] (03PS3) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [11:43:06] (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [11:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:45:14] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11213867 (10kostajh) >>! In... [11:45:49] jmm@cumin2002 drain-node (PID 323692) is awaiting input [11:47:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [11:47:35] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw for all core sections [11:48:00] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [11:51:02] (03PS19) 10Brouberol: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [11:52:01] (03Abandoned) 10Brouberol: Fix linting errors [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189764 (owner: 10Brouberol) [11:52:05] (03Abandoned) 10Brouberol: Fix test_flush_markers_on_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189765 (owner: 10Brouberol) [11:52:10] (03Abandoned) 10Brouberol: Pass the timeout to the underlying http client [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189766 (owner: 10Brouberol) [11:52:15] (03Abandoned) 10Brouberol: Simplify make_api_call function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191247 (owner: 10Ryan Kemper) [11:52:20] (03Abandoned) 10Brouberol: Flush markers propagates APIClientError [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191248 (owner: 10Ryan Kemper) [11:52:24] (03Abandoned) 10Brouberol: Remove test_flush_markers_on_clusters_fail_synced [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191303 (owner: 10Ryan Kemper) [11:52:28] (03Abandoned) 10Brouberol: Fix test_get_next_nodes_returns_masters_after_other_nodes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191304 (owner: 10Ryan Kemper) [11:52:32] (03Abandoned) 10Brouberol: WIP: rewriting test_force_allocation_of_all_unassigned_shards [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191310 (owner: 10Ryan Kemper) [11:54:04] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [11:55:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11213912 (10phaultfinder) [11:55:31] (03CR) 10Lucas Werkmeister (WMDE): CheckUser: Enable SI special page on enwiki and frwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz) [11:57:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [11:57:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [11:57:13] (03PS2) 10Dreamy Jazz: CheckUser: Enable SI special page on enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) [11:58:06] (03PS1) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) [11:58:27] (03PS1) 10D3r1ck01: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 [11:58:40] (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:58:54] (03PS1) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) [11:59:31] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1200) [12:04:24] (03PS2) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) [12:04:50] FIRING: DiskSpace: Disk space deploy1003:9100:/srv 3.532% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:40] (03PS3) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) [12:06:24] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191069 (owner: 10PipelineBot) [12:06:26] (03PS20) 10Brouberol: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [12:08:24] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191069 (owner: 10PipelineBot) [12:09:25] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:09:45] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:09:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579 (10cmooney) 03NEW p:05Triage→03Medium [12:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214001 (10phaultfinder) [12:10:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214002 (10cmooney) [12:10:15] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11214003 (10cmooney) [12:10:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [12:10:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214006 (10cmooney) [12:13:46] (03PS2) 10JMeybohm: haproxy ipblocks-all: Filter disabled ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014) [12:14:05] (03PS1) 10D3r1ck01: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 [12:14:23] (03CR) 10JMeybohm: "I did mess up the range loop in the last patchset. Corrected that as well." [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [12:14:27] (03PS1) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) [12:14:35] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11214010 (10Reedy) >>! In T4... [12:14:52] (03PS2) 10D3r1ck01: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) [12:15:05] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:15:19] (03PS1) 10D3r1ck01: hCaptcha: Fix mock for StatsFactory [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 [12:15:27] (03CR) 10CI reject: [V:04-1] session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [12:15:32] (03PS1) 10D3r1ck01: NewcomerTasks: Use StatsFactory unit test helper [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 [12:15:32] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:15:42] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:16:01] (03PS1) 10Slyngshede: P:openldap::management add netbox-readonly-access to offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) [12:16:10] (03PS1) 10KartikMistry: Update cxserver to 2025-09-25-074241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191364 (https://phabricator.wikimedia.org/T394982) [12:16:12] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:16:17] jmm@cumin2002 drain-node (PID 338228) is awaiting input [12:19:10] (03CR) 10D3r1ck01: "recheck" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [12:19:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214030 (10cmooney) [12:19:22] 06SRE, 06Infrastructure-Foundations: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580 (10elukey) 03NEW p:05Triage→03Unbreak! [12:23:10] jouncebot: nowandnext [12:23:10] For the next 0 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1200) [12:23:10] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1300) [12:25:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 (owner: 10D3r1ck01) [12:25:18] (03PS1) 10DDesouza: Pre-deploy design research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) [12:25:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [12:25:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 (owner: 10D3r1ck01) [12:26:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 (owner: 10D3r1ck01) [12:26:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [12:26:41] I'm going to deploy now [12:26:47] If anyone else isn't already [12:26:58] (03PS2) 10DDesouza: Pre-deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) [12:27:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 (owner: 10D3r1ck01) [12:27:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [12:28:28] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191374 [12:28:56] 06SRE, 06Infrastructure-Foundations: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214065 (10elukey) This may be the root cause: ` 2025-09-24T20:05:13.401802+00:00 puppetserver1001 sudo : TTY=pts/6 ; PWD=/home/denisse ; USER=root ; COMMAND=/usr/bin/puppet ss... [12:29:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11214071 (10phaultfinder) [12:31:33] (03PS1) 10Sbisson: Special:Contribute: configure new page target title for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) [12:33:53] (03PS1) 10D3r1ck01: Enable multibackend session store on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) [12:34:27] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11214082 (10cmooney) @Jclark-ctr @VRiley-WMF I may have missed to check we have the cables needed for these already. We're re-using exsiting... [12:34:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch) [12:34:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11214084 (10cmooney) a:03cmooney [12:34:44] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11214083 (10kostajh) >>! In... [12:34:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214086 (10phaultfinder) [12:35:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [12:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:30] (03PS2) 10D3r1ck01: Enable multibackend session store on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) [12:38:11] !log elukey@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for puppetserver1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1003 [12:38:35] (03PS5) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) [12:38:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:43:38] I'm actively deploying changes to private code and then will deploy the public config patch [12:44:49] (03PS1) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) [12:45:29] kubestagemaster1003 will do down for a Ganeti node reboot [12:45:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [12:46:25] (03CR) 10Muehlenhoff: "Typo inline, otherwise LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [12:47:15] (03PS2) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) [12:47:29] (03CR) 10Elukey: sre.puppet.renew-cert: skip destroy when needed. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [12:47:36] PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:43] !log swap read only for db1176/db2230 (test-s4) T403966 [12:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:49] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [12:49:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214114 (10phaultfinder) [12:50:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1191387 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [12:51:04] RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [12:51:57] FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:53:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [12:53:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [12:54:20] !log elukey@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for puppetserver1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1003 [12:55:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [12:56:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:57:12] (03CR) 10Muehlenhoff: "Please also update modules/admin/data/nda_groups.txt (evntually I'll fix the various scripts to read the list from there, but for now we n" [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [12:58:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz) [12:59:37] (03Merged) 10jenkins-bot: CheckUser: Enable SI special page on enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191312 (https://phabricator.wikimedia.org/T405556) (owner: 10Dreamy Jazz) [12:59:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214162 (10phaultfinder) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1300). [13:00:05] Dreamy_Jazz, tgr, danisztls, and WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1191312|CheckUser: Enable SI special page on enwiki and frwiki (T405556)]] [13:00:11] T405556: Suggested investigations: Enable special page on English and French Wikipedia - https://phabricator.wikimedia.org/T405556 [13:00:14] o/ [13:00:17] o/ [13:01:02] \o [13:01:07] I am self deploying my backports [13:01:28] o/ [13:01:29] jmm@cumin2002 drain-node (PID 359050) is awaiting input [13:01:30] Anyone that needs to merge a backport that will take a while in CI (like the core changes) should be safe to start now [13:01:36] As this config change shouldn't be reverted [13:01:46] *reverted at the test stage [13:01:52] (03PS1) 10Fabfur: haproxy:cache: discard requests w/o Host header [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) [13:02:06] I will be done after the one that I am currently deploying [13:02:22] thx, will do that then [13:02:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [13:02:47] (03CR) 10Gergő Tisza: [C:03+2] objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 (owner: 10D3r1ck01) [13:02:49] (03CR) 10Gergő Tisza: [C:03+2] session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [13:02:50] (03CR) 10Gergő Tisza: [C:03+2] NewcomerTasks: Use StatsFactory unit test helper [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 (owner: 10D3r1ck01) [13:02:53] (03CR) 10Gergő Tisza: [C:03+2] hCaptcha: Fix mock for StatsFactory [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 (owner: 10D3r1ck01) [13:02:59] (03CR) 10Gergő Tisza: [C:03+2] objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 (owner: 10D3r1ck01) [13:03:03] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214172 (10elukey) Tried to run the cookbook to renew the cert with a slight modification to skip the initial destroy. It failed when waiting for the new C... [13:03:09] (03CR) 10Gergő Tisza: [C:03+2] session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [13:03:13] (03PS1) 10Santiago Faci: xLab: Deploying v1.0.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191393 (https://phabricator.wikimedia.org/T385180) [13:03:53] (03CR) 10Majavah: [C:03+1] "This does fix the immediate issue and seems like a reasonable stop-gap unless/until we get around to converting everything everywhere to n" [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [13:04:06] tgr_: Feel free to +2 mine to ;-) [13:04:14] (03CR) 10Majavah: [C:04-1] wmcs: have additional IPs survive reboots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [13:04:38] (03PS4) 10Filippo Giunchedi: wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) [13:04:51] o/ I can self-deploy [13:07:41] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1191312|CheckUser: Enable SI special page on enwiki and frwiki (T405556)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:47] T405556: Suggested investigations: Enable special page on English and French Wikipedia - https://phabricator.wikimedia.org/T405556 [13:08:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [13:08:06] (03Merged) 10jenkins-bot: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191359 (owner: 10D3r1ck01) [13:08:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [13:08:31] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11214225 (10Tgr) >>! In T122097#11212641, @Krinkle wrote: > setting/changing a cookie is equivalent to discarding the br... [13:09:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.352s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:09:20] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [13:09:23] (03CR) 10Gergő Tisza: [C:03+2] Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch) [13:10:47] (03PS2) 10Fabfur: haproxy:cache: discard http1.0 requests [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) [13:14:09] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191312|CheckUser: Enable SI special page on enwiki and frwiki (T405556)]] (duration: 14m 04s) [13:14:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.352s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:14:16] T405556: Suggested investigations: Enable special page on English and French Wikipedia - https://phabricator.wikimedia.org/T405556 [13:14:19] Handing off to the next person [13:14:23] tgr_: [13:14:36] (03Merged) 10jenkins-bot: NewcomerTasks: Use StatsFactory unit test helper [extensions/GrowthExperiments] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191362 (owner: 10D3r1ck01) [13:15:25] thx [13:15:44] WMDE-Fisch: I'm deploying your backport as well then [13:15:54] tgr_: thanks yes [13:16:36] (03Merged) 10jenkins-bot: hCaptcha: Fix mock for StatsFactory [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191361 (owner: 10D3r1ck01) [13:16:39] (03Merged) 10jenkins-bot: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1191360 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [13:16:45] (03Merged) 10jenkins-bot: objectcache: Add a hit/miss flag to CachedBagOStuff [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191350 (owner: 10D3r1ck01) [13:16:50] (03Merged) 10jenkins-bot: session: Improve logging and monitoring in SessionStore implementations [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191351 (https://phabricator.wikimedia.org/T399195) (owner: 10D3r1ck01) [13:17:10] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [13:17:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet [13:19:15] (03PS3) 10Fabfur: haproxy:cache: discard http1.0 requests [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) [13:19:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11214262 (10phaultfinder) [13:19:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [13:19:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214261 (10phaultfinder) [13:20:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [13:24:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214287 (10phaultfinder) [13:25:21] (03Merged) 10jenkins-bot: Fix subref attribute order [extensions/Cite] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191341 (https://phabricator.wikimedia.org/T389363) (owner: 10WMDE-Fisch) [13:26:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [13:26:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet [13:27:03] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1191359|objectcache: Add a hit/miss flag to CachedBagOStuff]], [[gerrit:1191360|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191361|hCaptcha: Fix mock for StatsFactory]], [[gerrit:1191362|NewcomerTasks: Use StatsFactory unit test helper]], [[gerrit:1191350|objectcache: Add a hit/miss flag to CachedB [13:27:03] agOStuff]], [[gerrit:1191351|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191341|Fix subref attribute order (T389363)]] [13:27:11] T399195: Update logging and monitoring for multiple session storage backends - https://phabricator.wikimedia.org/T399195 [13:27:12] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:27:13] T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363 [13:29:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [13:31:49] (03CR) 10Ottomata: [C:03+1] [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [13:32:43] (03PS1) 10DDesouza: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) [13:33:29] (03CR) 10CI reject: [V:04-1] Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:33:33] !log tgr@deploy1003 d3r1ck01, wmde-fisch, tgr: Backport for [[gerrit:1191359|objectcache: Add a hit/miss flag to CachedBagOStuff]], [[gerrit:1191360|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191361|hCaptcha: Fix mock for StatsFactory]], [[gerrit:1191362|NewcomerTasks: Use StatsFactory unit test helper]], [[gerrit:1191350|objectcache: Add a hit/miss flag to Cache [13:33:33] dBagOStuff]], [[gerrit:1191351|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191341|Fix subref attribute order (T389363)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:33:42] T399195: Update logging and monitoring for multiple session storage backends - https://phabricator.wikimedia.org/T399195 [13:33:43] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:33:43] T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363 [13:33:44] * WMDE-Fisch testing [13:33:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:35:01] jmm@cumin2002 drain-node (PID 377806) is awaiting input [13:38:12] tgr_: I'm fine [13:38:25] !log tgr@deploy1003 d3r1ck01, wmde-fisch, tgr: Continuing with sync [13:38:38] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214327 (10taavi) `lang=shell-session taavi@puppetserver1001 ~ $ sudo mv /var/lib/puppet/ssl/private_keys/puppetserver1001.eqiad.wmnet.pem /root/puppetserv... [13:39:50] (03PS2) 10DDesouza: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) [13:40:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:42] (03CR) 10CI reject: [V:04-1] Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:43:12] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191359|objectcache: Add a hit/miss flag to CachedBagOStuff]], [[gerrit:1191360|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191361|hCaptcha: Fix mock for StatsFactory]], [[gerrit:1191362|NewcomerTasks: Use StatsFactory unit test helper]], [[gerrit:1191350|objectcache: Add a hit/miss flag to Cached [13:43:12] BagOStuff]], [[gerrit:1191351|session: Improve logging and monitoring in SessionStore implementations (T399195 T402808)]], [[gerrit:1191341|Fix subref attribute order (T389363)]] (duration: 16m 09s) [13:43:20] T399195: Update logging and monitoring for multiple session storage backends - https://phabricator.wikimedia.org/T399195 [13:43:21] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:43:22] T389363: Fix attribute order round-tripping for sub-references (dirty diff) - https://phabricator.wikimedia.org/T389363 [13:44:12] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11214358 (10elukey) p:05Unbreak!→03High The above fix worked, really nice save @taavi! Next steps (imho): * Create a simple cookbook to clean up certs... [13:44:39] (03CR) 10Ssingh: lvs1018: remove L2 sub-interface config for row E/F vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney) [13:44:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214360 (10phaultfinder) [13:46:02] danisztls: should I deploy 1191370 along with the other config change? [13:46:15] tgr_: yes, please [13:46:23] I can also self-deploy if you prefer [13:46:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:46:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [13:47:38] (03Merged) 10jenkins-bot: Enable multibackend session store on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191378 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:47:46] (03Merged) 10jenkins-bot: Pre-deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191370 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [13:47:58] (03PS3) 10DDesouza: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) [13:48:12] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1191378|Enable multibackend session store on beta and testwiki (T402808)]], [[gerrit:1191370|Pre-deploy Design Research participant recruitment survey on jawiki (T405577)]] [13:48:20] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [13:48:50] kubestagemaster1003 and dse-k8s-etcd1002 will do down for a Ganeti node reboot [13:48:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [13:49:01] (03PS1) 10Michael Große: fix: prevent type-error from outdated serialization [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191414 (https://phabricator.wikimedia.org/T405511) [13:50:24] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:25] RESOLVED: SystemdUnitFailed: upload_puppet_facts.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:42] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:52:16] (03CR) 10DDesouza: [C:03+2] Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:52:28] (03CR) 10Ssingh: haproxy:cache: discard http1.0 requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [13:53:42] (03Merged) 10jenkins-bot: Create Reader foundational research phase III survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191405 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:54:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [13:54:44] !log tgr@deploy1003 tgr, d3r1ck01, dani: Backport for [[gerrit:1191378|Enable multibackend session store on beta and testwiki (T402808)]], [[gerrit:1191370|Pre-deploy Design Research participant recruitment survey on jawiki (T405577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [13:54:50] FIRING: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:51] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:54:52] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [13:55:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11214432 (10phaultfinder) [13:55:28] RECOVERY - Host kubestagemaster1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.48 ms [13:55:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye [13:55:44] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [13:55:57] FIRING: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:56:05] tgr_: looks good [13:59:09] RESOLVED: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:25] !log tgr@deploy1003 tgr, d3r1ck01, dani: Continuing with sync [13:59:42] (03CR) 10Fabfur: haproxy:cache: discard http1.0 requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [13:59:50] (03CR) 10Ssingh: "Looks good but let's run varnish tests on this one before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [13:59:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11214445 (10phaultfinder) [14:00:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet [14:00:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:04:24] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191378|Enable multibackend session store on beta and testwiki (T402808)]], [[gerrit:1191370|Pre-deploy Design Research participant recruitment survey on jawiki (T405577)]] (duration: 16m 11s) [14:04:32] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [14:04:33] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [14:06:13] !log UTC afternoon deploys done [14:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] ml-etcd1001 will do down for a Ganeti node reboot [14:06:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [14:08:51] (03PS1) 10CDanis: Search inside inline pattern values [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1191424 [14:08:52] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:41] (03PS1) 10DDesouza: Deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) [14:10:07] (03CR) 10CDanis: [V:03+2 C:03+2] Search inside inline pattern values [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1191424 (owner: 10CDanis) [14:10:26] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [14:10:29] !log cdanis@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "feat: search inside inline pattern values - cdanis@cumin1003" [14:10:31] !log cdanis@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: search inside inline pattern values - cdanis@cumin1003 [14:11:20] !log cdanis@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: search inside inline pattern values - cdanis@cumin1003 [14:11:21] !log cdanis@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "feat: search inside inline pattern values - cdanis@cumin1003" [14:11:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [14:11:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet [14:12:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet [14:13:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [14:14:04] (03CR) 10Ssingh: [C:03+1] haproxy:cache: discard http1.0 requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [14:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:14:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11214574 (10phaultfinder) [14:15:24] (03CR) 10Bking: opensearch-operator: move WMF-specific values to chart values.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190343 (https://phabricator.wikimedia.org/T404906) (owner: 10Bking) [14:16:02] jouncebot nowandnext [14:16:02] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [14:16:02] In 0 hour(s) and 13 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1430) [14:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:16:45] (03PS1) 10DDesouza: Reader foundational on enwiki (beta): Add additional config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) [14:17:46] (03PS2) 10Cathal Mooney: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) [14:18:37] (03CR) 10DDesouza: [C:03+2] Reader foundational on enwiki (beta): Add additional config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [14:18:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [14:18:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet [14:19:35] (03Merged) 10jenkins-bot: Reader foundational on enwiki (beta): Add additional config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [14:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:19:52] (03PS1) 10Slyngshede: P:cache::haproxy add dummy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1191428 [14:19:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214618 (10phaultfinder) [14:19:58] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:20:38] (03PS1) 10Cathal Mooney: ssw1-d8-eqiad: add bgp peerings to CR and Juniper spines [homer/public] - 10https://gerrit.wikimedia.org/r/1191429 (https://phabricator.wikimedia.org/T396063) [14:22:19] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5001.eqsin.wmnet with OS trixie [14:22:34] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11214629 (10Krinkle) >>! In T122097#11214225, @Tgr wrote: >>>! In T122097#11212641, @Krinkle wrote: >> setting/changing... [14:22:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS trixie [14:24:05] (03CR) 10Fabfur: [C:03+1] "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1191428 (owner: 10Slyngshede) [14:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:24:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214651 (10phaultfinder) [14:25:35] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy add dummy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1191428 (owner: 10Slyngshede) [14:26:11] (03CR) 10Fabfur: [C:03+2] haproxy:cache: discard http1.0 requests [puppet] - 10https://gerrit.wikimedia.org/r/1191392 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [14:27:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:27:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1430) [14:30:15] (03PS4) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) [14:32:06] (03CR) 10MVernon: [C:03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:33:50] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602 (10cmooney) 03NEW p:05Triage→03Medium [14:37:03] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp1107.eqiad.wmnet are marked down but pooled: uploadlb_443: Servers cp1107.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:37:21] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp1107.eqiad.wmnet are marked down but pooled: uploadlb_443: Servers cp1107.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:37:27] uh? [14:37:30] fabfur: ^ [14:37:49] is anyone working on cp1107? [14:38:14] not really [14:38:28] mmm [14:38:35] fabfur: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/c936de676283b3d9e2ec4af46a82c570f7f98974 [14:38:41] can this be related to the above? [14:38:48] I don't see how but maybe the checks? [14:38:58] pybal does it issue http1.0 checks??? [14:39:02] ok let me revert it [14:39:09] yeah let's revert [14:39:17] and then look [14:39:19] before it gets worse [14:39:33] We do actually have an experimentation-related deployment to do. sukhe, fabfur: Would that be OK or do those alerts block a deployment? [14:39:46] (03PS1) 10Fabfur: Revert "haproxy:cache: discard http1.0 requests" [puppet] - 10https://gerrit.wikimedia.org/r/1191432 [14:39:58] phuedx: if you can wait for five minutes, that might be helpful [14:39:58] (03CR) 10Fabfur: [C:03+2] Revert "haproxy:cache: discard http1.0 requests" [puppet] - 10https://gerrit.wikimedia.org/r/1191432 (owner: 10Fabfur) [14:40:00] (03CR) 10Fabfur: [V:03+2 C:03+2] Revert "haproxy:cache: discard http1.0 requests" [puppet] - 10https://gerrit.wikimedia.org/r/1191432 (owner: 10Fabfur) [14:40:01] just to rule this out [14:40:04] sukhe: ACK [14:40:13] revert submitted [14:40:15] fabfur: let me know when puppet merge finishes [14:40:23] I will run agent on 1107 [14:40:57] it might very well be red herring but yes, let's not risk this one :] [14:40:58] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [14:41:04] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [14:41:31] revert merge finished [14:41:35] thanks [14:41:37] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:41:44] running puppet on A:cp ? [14:41:46] it would be shocking if we are doing HTTP1.0 there but who knows [14:41:52] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:41:57] fabfur: go for it [14:42:03] ack [14:42:15] we should check after we are done merging the revert [14:42:38] !log merging revert for HTTP1.0 discard on cp1107 [14:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:59] Sep 25 14:42:54 lvs1020 pybal[1450515]: [uploadlb_443 ProxyFetch] WARN: cp1109.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed (https://upload.wikimedia.org/varnish-fe-hc-5ebea9), 0.194 s [14:43:03] yeah [14:43:06] fabfur: roll it out to A:cp [14:43:13] {{doing}} [14:43:22] no batches [14:43:59] it was definitely that [14:44:17] depool threshold saving the day once again [14:44:21] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp1108.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1108.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:44:27] phuedx: definitely wait before we roll this out. thank you [14:44:50] fabfur: what's the progress? [14:44:56] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [14:45:52] I still can't believe it was actually HTTP1.0. or we put it in the wrong place but yeah [14:46:02] though varnish won [14:46:03] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:46:21] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:46:21] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:46:24] phew [14:46:29] sukhe: currently 40% [14:46:33] thanks <3 [14:46:57] fabfur: lesson for us I guess for next time: the old approach of disabling puppet on A:cp for even trivial changes [14:46:59] I'll open a ticket about this [14:47:00] (03Merged) 10jenkins-bot: rest-gateway: Tighten non mw-rest-php matches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191333 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [14:47:04] enabling on one and then going ahead [14:47:05] yep [14:47:10] FIRING: [6x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:47:17] ^ unrelated, this is durum [14:47:20] in that case we also had to wait for a probe to fail [14:48:01] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:48:10] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:48:27] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:49:22] sukhe: {{done}} [14:49:30] thanks! [14:49:41] (03PS3) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) [14:49:48] phuedx: go ahead please :) [14:51:07] (03PS4) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) [14:51:56] (03CR) 10Jelto: "thanks for the review, replies in the comments" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:52:10] FIRING: [6x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:53:05] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [14:53:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [14:54:26] (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [14:54:40] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [14:54:55] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11214861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [14:55:00] (03CR) 10TChin: [C:03+2] [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [14:55:10] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7051/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:55:23] (03CR) 10MVernon: [C:03+1] ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:57:23] (03Merged) 10jenkins-bot: [eventgate-*] Bump to v1.24.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191234 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [14:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.98%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [14:59:32] (03CR) 10Elukey: [C:03+1] Nokia: support mixing of L2 and L3 subinterfaces on SR Linux (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:59:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [14:59:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214908 (10phaultfinder) [15:00:04] brennen and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1500). [15:00:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11214910 (10cmooney) [15:00:27] jmm@cumin2002 drain-node (PID 416850) is awaiting input [15:00:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11214911 (10cmooney) [15:01:33] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [15:01:53] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [15:02:55] sukhe: Belated ACK. Thanks [15:03:13] sukhe@cumin1003 reimage (PID 4138186) is awaiting input [15:03:15] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214927 (10Jhancock.wm) moved one server to different breaker. holding to see if alert stops going off. [15:03:33] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11214928 (10Jhancock.wm) a:03Jhancock.wm [15:04:21] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum7003.magru.wmnet with OS trixie [15:04:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [15:05:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11214933 (10phaultfinder) [15:05:47] !log brennen@deploy1003 Started deploy [phabricator/deployment@5d4a2bb]: deploy phab2002 for T404134 [15:05:53] T404134: Merge Phorge's upstream master (2025-09-08) into our wmf/stable - https://phabricator.wikimedia.org/T404134 [15:06:14] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11214942 (10Jhancock.wm) a:03Jhancock.wm moved one server to different breaker. holding to see if alert stops triggering. [15:06:28] !log brennen@deploy1003 Finished deploy [phabricator/deployment@5d4a2bb]: deploy phab2002 for T404134 (duration: 00m 41s) [15:08:01] (03PS5) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) [15:08:51] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11214950 (10elukey) @Jhancock.wm I cannot reboot the host, tried via console and BMC/Redfish API, it seems stuck in some weird limbo. If you have a moment could you please check it? [15:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11214952 (10phaultfinder) [15:10:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [15:10:11] (03PS6) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) [15:10:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [15:10:53] (03CR) 10Jasmine: wmnet: update deployment CNAME record to deploy2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [15:11:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609 (10cmooney) 03NEW p:05Triage→03Medium [15:11:04] !log brennen@deploy1003 Started deploy [phabricator/deployment@5d4a2bb]: deploy phab1004 for T404134 [15:11:10] T404134: Merge Phorge's upstream master (2025-09-08) into our wmf/stable - https://phabricator.wikimedia.org/T404134 [15:11:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet [15:11:23] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11214968 (10cmooney) [15:11:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11214967 (10cmooney) [15:11:59] (03PS3) 10Cathal Mooney: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) [15:13:17] getting 'Unable to load the "Arcanist" library. Put "arcanist/" next to "phorge/" on disk.' from phabricator [15:13:33] Same here. [15:13:39] Was just about to report the same [15:13:56] we are deploying a new Phab version [15:13:57] yes Phabricator is getting a new version deploy, it should resolve soon [15:13:58] (03CR) 10Cathal Mooney: [C:03+2] Nokia: support mixing of L2 and L3 subinterfaces on SR Linux (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:14:53] !log brennen@deploy1003 Finished deploy [phabricator/deployment@5d4a2bb]: deploy phab1004 for T404134 (duration: 03m 49s) [15:15:22] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [15:15:36] (03Merged) 10jenkins-bot: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [15:15:37] Phabricator should be back :) [15:16:01] Thanks jelto . [15:16:13] thanks jelto! [15:16:27] mostly brennen and andre :) I was just holding hands [15:16:30] Thanks! Just as an aside I needed to force refresh to get some styles to look normal [15:17:25] Though that may be completely unrelated to this :D [15:17:44] Dreamy_Jazz: probably related [15:18:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet [15:18:28] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11214980 (10Jhancock.wm) @bking can you check the site.pp and preseed.yaml files for accuracy? the reimage cookbook is acting like there's a possible misconfig there. thank you! [15:18:48] sorry for downtime there all. slightly longer deploy than standard. [15:19:14] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [15:20:41] brennen: np and thanks for the new version. [15:22:08] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:22:32] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:23:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet [15:24:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet [15:28:15] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11215037 (10Jhancock.wm) @elukey found the server off. i could ping the BMC and login to it. I've powered it back up for you. [15:28:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215042 (10bking) a:03bking [15:30:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215048 (10bking) @Jhancock.wm I think we had a similar ticket for the same hardware in EQIAD (T399105) . I'll take a look there and see if we m... [15:30:40] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11215053 (10Jhancock.wm) [15:31:59] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [15:32:57] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [15:32:58] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device fasw1-f5a-codfw.mgmt.codfw.wmnet [15:33:00] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:34:06] !log sudo puppet node deactivate durum7003.magru.wmnet: stuck after reimage with failed puppet run [15:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:50] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:09] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11215096 (10phaultfinder) [15:35:17] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm [15:38:26] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.0.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191393 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci) [15:38:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5a-codfw - pt1979@cumin2002" [15:38:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5a-codfw - pt1979@cumin2002" [15:38:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:12] (03Merged) 10jenkins-bot: xLab: Deploying v1.0.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191393 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci) [15:41:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS trixie [15:44:16] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:44:43] (03PS5) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [15:44:43] (03PS5) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [15:44:43] (03PS5) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [15:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:45:38] jouncebot: nowandnext [15:45:38] For the next 0 hour(s) and 14 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1500) [15:45:38] In 0 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600) [15:45:38] In 0 hour(s) and 14 minute(s): Deployment Server Switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600) [15:47:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:47:11] (03CR) 10Hnowlan: [C:03+1] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [15:48:22] hi folks, just a reminder that we'll be switching the deployment server to codfw shortly [15:50:32] (03PS1) 10Bking: Add dse-k8s-worker2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1191441 (https://phabricator.wikimedia.org/T399778) [15:50:34] noting here that i have a couple of backports to handle before the train, will wait until after deployment server switchover. [15:51:47] (03PS2) 10Bking: Add dse-k8s-worker2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1191441 (https://phabricator.wikimedia.org/T399778) [15:54:12] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:54:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11215224 (10phaultfinder) [15:56:08] (03CR) 10Bking: [C:03+2] "trivial change and blocking DC Ops, so self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/1191441 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking) [16:00:05] jhathaway and moritzm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:05] jasmine_: That opportune time for a Deployment Server Switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600). [16:01:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26), 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215247 (10bking) @Jhancock.wm it looks like the host was missing from site.pp. I've added it, and you should be good to g... [16:01:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26), 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215250 (10bking) a:05bking→03Jhancock.wm [16:02:45] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618 (10Papaul) 03NEW [16:04:50] FIRING: DiskSpace: Disk space deploy1003:9100:/srv 2.827% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:04] Please refrain from deploying or otherwise using the deploy servers until the all-clear is given [16:05:32] (03CR) 10Mstyles: [C:03+2] OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [16:05:56] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [16:06:19] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [16:06:22] (03Merged) 10jenkins-bot: OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [16:06:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26), 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11215269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2... [16:07:54] (03PS1) 10Clément Goubert: rest-gateway: Relax leading slash match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) [16:08:03] (03CR) 10CI reject: [V:04-1] rest-gateway: Relax leading slash match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) (owner: 10Clément Goubert) [16:08:09] (03PS2) 10Clément Goubert: rest-gateway: Relax leading slash match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191444 (https://phabricator.wikimedia.org/T405368) [16:08:45] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [16:09:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw1-f5a-codfw.mgmt.codfw.wmnet [16:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11215277 (10phaultfinder) [16:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11215278 (10phaultfinder) [16:11:38] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11215284 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are 3 servers on the EOL list. Two of them have already been replaced but waiting on exte... [16:13:16] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device fasw1-f5a-codfw [16:13:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw1-f5a-codfw [16:13:54] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device fasw1-f5b-codfw.mgmt.codfw.wmnet [16:13:56] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:15:07] !log sopped spiderpig-apiserver, spiderpig-jobrunner on deploy1003 [16:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:03] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11215304 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there are at least two servers in this rack that are on the EOL list and can be removed once th... [16:19:37] pt1979@cumin2002 provision (PID 457966) is awaiting input [16:22:45] !log jasmine@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet,releases1003.eqiad.wmnet with reason: Deployment server switchover [16:23:26] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5b-codfw - pt1979@cumin2002" [16:23:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for fasw1-f5b-codfw - pt1979@cumin2002" [16:23:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:44] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11215341 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rack had a huge spike that has setteled to threshold since the original switchover. rack has se... [16:24:16] (03CR) 10Jasmine: [C:03+2] wmnet: update deployment CNAME record to deploy2002 [dns] - 10https://gerrit.wikimedia.org/r/1190298 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [16:25:27] (03PS1) 10Clare Ming: xLab: instrument page visits with delayed events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 [16:25:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 (owner: 10Clare Ming) [16:25:55] !log jasmine@dns1004 START - running authdns-update [16:26:01] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum7003.magru.wmnet with OS bookworm [16:26:28] PROBLEM - ganeti-noded running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:27:30] !log jasmine@dns1004 END - running authdns-update [16:28:24] (03CR) 10Jasmine: [C:03+2] hieradata: update deployment_server to deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/1190300 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [16:28:28] RECOVERY - ganeti-noded running on ganeti-test2002 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:29:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405403#11215380 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm there is one server that is EOL and could be decommed. physically marking rack as full and addi... [16:30:55] (03PS2) 10Slyngshede: P:openldap::management add netbox-readonly-access to offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) [16:31:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11215392 (10cmooney) [16:31:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11215393 (10cmooney) [16:31:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from deployment.eqiad.wmnet in ulsfo #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=ulsfo&var-cluster=text&var-origin=deployment.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:32:13] uhh almost certainly related to the switchover [16:32:13] <_joe_> hnowlan: tsk tsk [16:32:14] looking [16:32:16] <_joe_> ahah yes [16:32:22] ahh [16:32:23] spiderpig :) [16:32:26] Ooooh spiderpig [16:32:27] <_joe_> yep [16:33:09] apologies oncallers! [16:33:13] <_joe_> 🤌what the hell is a spider pig? 🤌 [16:33:26] It does whatever a spiderpig does [16:33:34] Alert silenced [16:34:02] _joe_: https://youtu.be/BARjPuUN36Y?si=cDSJfNGYmBWjjnEi&t=26 [16:34:39] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11215414 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm marking rack as full. adding to main tracking list. there are 4 servers in the EOL list in this... [16:35:32] !incidents [16:35:32] 6799 (ACKED) ATSBackendErrorsHigh cache_text sre (deployment.eqiad.wmnet ulsfo) [16:35:33] 6795 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet codfw) [16:36:15] ty! [16:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:08] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11215456 (10Jhancock.wm) a:03Jhancock.wm moved one server to different breaking. holding to see if resolved. [16:40:44] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:18] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:41:51] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:46:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11215467 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm moved power of one server to different breaker. marking rack as physically full for now. But th... [16:46:32] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:47:12] (03PS1) 10Santiago Faci: xLab: Deploying v1.0.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191449 (https://phabricator.wikimedia.org/T385180) [16:52:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623 (10RobH) 03NEW [16:52:22] 06SRE, 05MW-1.45-notes (1.45.0-wmf.21; 2025-09-30), 13Patch-For-Review, 03Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic... - https://phabricator.wikimedia.org/T404204#11215516 [16:52:55] !log jasmine@deploy2002 Started scap sync-world: Test deployment to validate deployment server switchover - T399891. [16:53:01] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [16:54:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw1-f5b-codfw.mgmt.codfw.wmnet [16:56:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11215549 (10RobH) @BCornwall, Congrats, since we've worked together on so many other projects previously I made the #traffic team's host migration tracking task first! As such, we ma... [16:57:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11215552 (10RobH) [16:58:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11215553 (10EMill-WMF) Yes, I can see the dashboards I was intending to now! Thank you very much for everyone who helped resolve my issue, and apologies f... [16:58:44] (03PS2) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) [16:59:26] (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [17:00:05] jasmine_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Deployment Server Switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600). [17:00:05] bd808: How many deployers does it take to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1700). [17:00:33] deployment switchover still in progress, nearly done :) [17:00:50] ty! [17:01:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11215627 (10Dzahn) Thanks @EMill-WMF for confirming that. Great! And no need to apologize. The whole thing was about how that process is confusing even f... [17:02:07] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device fasw1-f5b-codfw [17:02:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw1-f5b-codfw [17:02:53] (03PS3) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) [17:07:18] (03PS1) 10Jcrespo: mariadb: Add new grants for dbprov1007 & dbprov2007 backups [puppet] - 10https://gerrit.wikimedia.org/r/1191451 (https://phabricator.wikimedia.org/T403166) [17:11:39] (03PS1) 10DDesouza: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) [17:12:31] (03CR) 10CI reject: [V:04-1] Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [17:12:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [17:14:31] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [17:14:33] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [17:14:54] Amir1: thank you for deploying the mariadb template thing the other day. I appreciated that. [17:15:03] (03CR) 10DDesouza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [17:15:09] (03PS1) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [17:17:44] (03PS2) 10DDesouza: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) [17:18:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628 (10cmooney) 03NEW p:05Triage→03Medium [17:18:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11215779 (10cmooney) [17:18:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215780 (10cmooney) [17:19:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11215792 (10cmooney) [17:21:45] (03PS1) 10Clément Goubert: Revert "rest-gateway: Tighten non mw-rest-php matches" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191456 [17:21:58] (03CR) 10Clément Goubert: [C:03+2] Revert "rest-gateway: Tighten non mw-rest-php matches" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191456 (owner: 10Clément Goubert) [17:23:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11215807 (10Dzahn) To be pragmatic.. let's just start with the lowest level of (that type of) access and see in practice if you run into any blockers. Level... [17:23:42] (03Merged) 10jenkins-bot: Revert "rest-gateway: Tighten non mw-rest-php matches" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191456 (owner: 10Clément Goubert) [17:25:31] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:25:40] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:25:47] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630 (10cmooney) 03NEW p:05Triage→03Medium [17:26:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215830 (10cmooney) [17:26:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215831 (10cmooney) [17:27:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215832 (10cmooney) [17:27:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215833 (10cmooney) [17:32:23] !log jasmine@deploy2002 Finished scap sync-world: Test deployment to validate deployment server switchover - T399891. (duration: 39m 28s) [17:32:29] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [17:33:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632 (10cmooney) 03NEW p:05Triage→03Medium [17:33:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11215881 (10cmooney) [17:33:23] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215882 (10cmooney) [17:34:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.905s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:34:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:35:00] (03PS1) 10Dzahn: admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) [17:36:29] (03PS2) 10Dzahn: admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) [17:37:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11215886 (10cmooney) [17:37:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215890 (10Dzahn) 05Open→03In progress a:03Dzahn [17:39:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215905 (10cmooney) [17:39:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.905s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:40:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11215913 (10cmooney) [17:41:19] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215925 (10Dzahn) To be pragmatic I am going with a "one level upgrade" from the lowest to the second lowest level here: https:... [17:41:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling required for fr-tech expansion and row a/b switch refresh - https://phabricator.wikimedia.org/T402432#11215927 (10cmooney) [17:41:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11215928 (10cmooney) [17:42:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215944 (10cmooney) [17:42:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215945 (10cmooney) [17:42:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11215946 (10cmooney) [17:42:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215947 (10cmooney) [17:43:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215948 (10Dzahn) @TLessa-WMF Could you take a look at signing L3 while the code change I uploaded is in review? Cheers [17:43:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11215950 (10cmooney) [17:44:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [17:44:27] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:45:06] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215952 (10Dzahn) [17:45:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11215953 (10BCornwall) p:05Triage→03Medium [17:45:16] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.0.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191449 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci) [17:46:06] update: deployment switchover is now complete :) folks should be able to deploy from deploy2002 now, let us know if you notice anything [17:46:32] congrats jasmine_ [17:46:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11215958 (10Dzahn) [17:46:52] Yay! [17:47:13] tyty, (and thanks h.nowlan and c.laime for shadowing!) [17:47:18] (03Merged) 10jenkins-bot: xLab: Deploying v1.0.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191449 (https://phabricator.wikimedia.org/T385180) (owner: 10Santiago Faci) [17:47:21] fwiw, /srv/patches should have been synced automatically.. for the security guys [17:49:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11215964 (10BTracy-WMF) That sounds like the right path forward. Thank you! [17:49:15] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11215963 (10cmooney) [17:49:43] jasmine_: ty! [17:50:00] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11215967 (10cmooney) [17:50:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11215968 (10cmooney) [17:50:19] jouncebot nowandnext [17:50:19] For the next 0 hour(s) and 9 minute(s): Deployment Server Switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1600) [17:50:19] For the next 0 hour(s) and 9 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1700) [17:50:19] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1800) [17:52:05] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#11215973 (10cmooney) 05Open→03Declined Gonna close this one for now. Doing it in our YAML data for the occasional virtual-chassis... [17:52:51] going ahead with a couple of backports, then train. [17:54:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191334 (https://phabricator.wikimedia.org/T405514) (owner: 10Sergio Gimeno) [17:54:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191414 (https://phabricator.wikimedia.org/T405511) (owner: 10Michael Große) [17:56:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [17:56:34] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637 (10cmooney) 03NEW p:05Triage→03Medium [17:56:47] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [17:57:12] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637#11216041 (10cmooney) [17:57:16] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216042 (10cmooney) [17:57:29] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [17:57:34] (03CR) 10Muehlenhoff: [C:03+1] "Needs manager approval on task, but patch looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) (owner: 10Dzahn) [17:59:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11216044 (10Dzahn) @TLessa-WMF We need one more thing. Please get your manager to approve here on this ticket. Thank you [18:00:07] brennen and dduvall: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1800). [18:01:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216104 (10Dzahn) Great! I am going ahead. Meanwhile.. please check the box that you have read https://wikitech.wikimedia.org/wiki/Data_Platform/Data_acce... [18:02:29] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640 (10cmooney) 03NEW p:05Triage→03Medium [18:04:16] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11216198 (10cmooney) [18:04:22] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216197 (10cmooney) [18:04:38] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11216208 (10Jhancock.wm) [18:04:51] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Create script to allow multiple host migrations from old -> new switch - https://phabricator.wikimedia.org/T405640#11216214 (10cmooney) [18:04:53] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: General updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11216215 (10cmooney) [18:08:31] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [18:09:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11216318 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet wi... [18:09:45] (03Merged) 10jenkins-bot: fix: provide a eventType fallback for already scheduled jobs [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191334 (https://phabricator.wikimedia.org/T405514) (owner: 10Sergio Gimeno) [18:09:47] (03Merged) 10jenkins-bot: fix: prevent type-error from outdated serialization [extensions/GrowthExperiments] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191414 (https://phabricator.wikimedia.org/T405511) (owner: 10Michael Große) [18:11:03] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [18:11:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11216380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [18:11:16] hmm, unexpected commits in mediawiki-staging [18:12:37] (03CR) 10Dzahn: [C:03+1] "I left some inline comments but they can all be taken as optional/nitpicks/comments. Compiler compiles and seems reasonable:" [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [18:14:17] brennen: jasmine_: that sounds like one of the rsync timers/services has not run yet or failed [18:14:32] yeah, my guess is these have already been deployed. [18:14:37] based on timing of patches. [18:14:41] (03PS1) 10Aklapper: Phabricator: Update recipients of weekly Tech News mail [puppet] - 10https://gerrit.wikimedia.org/r/1191468 (https://phabricator.wikimedia.org/T405638) [18:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:15:03] which... probably means it's fine? [18:15:49] puppet class deployment::rsync has the stuff that syncs automatically between deployment servers [18:16:01] it handles /srv/deployment and /srv/patches [18:16:14] maybe staging is not handled [18:17:03] it's possible this should be added (rsync with --delete ?) to keep them identical. [18:17:51] plausible. my mental model at the moment is that staging was a bit out of date but those changes were fetched down so it should now be in sync. [18:17:52] `/usr/local/bin/scap-master-sync` is supposed to do that. I added that as a step in https://wikitech.wikimedia.org/wiki/Switch_Datacenter/DeploymentServer#Procedure last week [18:18:16] what is rsync host and rsync dest switches with $deployment_server [18:19:11] I see. Optionally we could a path to the puppet class so that the same code handles all. [18:19:50] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:20:03] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11216493 (10phaultfinder) [18:20:28] brennen: are you trying the scap-master-sync then? [18:20:40] (03CR) 10Quiddity: [C:03+1] Phabricator: Update recipients of weekly Tech News mail [puppet] - 10https://gerrit.wikimedia.org/r/1191468 (https://phabricator.wikimedia.org/T405638) (owner: 10Aklapper) [18:20:48] i'm mid-deploy job on spiderpig [18:21:12] Scap runs scap-master-sync of any `scap sync-*` operation. [18:21:19] s/of/during/ [18:21:27] could say no here, but i think it should be fine to proceed under the assumption the 2 extra config changes have already been deployed a while ago. [18:21:29] *nod* to both of you. alrighty [18:23:11] brennen: I say go [18:23:13] (03PS1) 10Dzahn: admin: add user btracy with privateadata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) [18:23:48] only 2 changes and its been verified that they are already deployed.. sounds ok [18:23:56] hmm.. [18:24:14] * dancy checks the sync script [18:24:20] `tree /srv/patches` looks the same on both boxen [18:24:22] well, I guess we jumped from assumption to verified there :P [18:24:25] at least for # of files [18:24:34] mutante: both were merged earlier. [18:24:34] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1191334|fix: provide a eventType fallback for already scheduled jobs (T405514)]], [[gerrit:1191414|fix: prevent type-error from outdated serialization (T405511)]] [18:24:43] T405514: InvalidArgumentException: 'type' parameter is mandatory - https://phabricator.wikimedia.org/T405514 [18:24:43] T405511: TypeError: GrowthExperiments\NewcomerTasks\Task\TaskSet::__construct(): Argument #4 ($filters) must be of type GrowthExperiments\NewcomerTasks\Task\TaskSetFilters, array given - https://phabricator.wikimedia.org/T405511 [18:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:25:13] yea, /srv/patches is handled by puppet for sure. I once added that. [18:25:20] rsync-patches_module.timer [18:25:38] systemctl cat rsync-patches_module.service [18:26:15] cat /usr/local/sbin/sync-patches_module [18:26:28] it has --delete [18:26:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216519 (10BTracy-WMF) [18:26:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216524 (10BTracy-WMF) I've read the guidelines and updated this request to reflect. Thanks, again. [18:27:59] The scap-sync-master script looks right too [18:28:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647 (10RobH) 03NEW [18:29:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216543 (10Dzahn) Great. Thanks! I have uploaded a code change to make it happen and it's in review now. [18:30:23] hrm, crap. maybe these didn't get deployed. e.g. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1191100 wasn't deployed with `scap backport`. [18:31:56] maryum: Are you around? [18:32:01] < jinxer-wm> FIRING: SystemdUnitFailed: rsync-srv-patches-releases2003.codfw.wmnet.service on releases2003:9100 [18:32:12] releases hosts are also pulling srv/patches from deployment [18:32:15] the other one is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1191427 [18:32:35] and somehow that deployment_server switch and timing with puppet runs or whatever made it fail [18:32:47] oh wait, both for beta and not synced, perhaps? [18:33:23] brennen: The protocol is that even beta-only changes should be backported (where scap backport itself will short-circuit the process).. but I don't know if that protocol was followed here. [18:33:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11216609 (10RobH) a:03BCornwall [18:33:31] Doesn't seem so [18:34:05] 1191100 was +2'd directly after the "do not deploy anything, deployment server is being switched over" message :/ [18:34:07] yeah, beta-only change not using scap backport would explain the 1191427 one. [18:34:13] Two unusual config deployments at a particularly weird time. [18:34:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216626 (10Dzahn) 05Open→03In progress [18:34:54] 1191100 not so much. [18:34:59] probably best to just revert? [18:35:12] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11216632 (10Dzahn) a:03Dzahn [18:36:13] (03CR) 10Majavah: "Hi, please note that patches need to be pulled down to the deployment server (either with `scap backport` or manually) even if they only t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [18:37:10] I'm here! [18:37:12] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216658 (10Dzahn) [18:37:22] (03PS4) 10Daniel Kinzler: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 [18:37:27] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216659 (10Dzahn) p:05Triage→03High [18:37:40] maryum: happen to know if that config change had already been deployed? [18:37:43] (03PS4) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [18:37:56] no it hadn't, I forgot I needed to schedule a backport window [18:38:01] apologies for breaking things [18:38:27] not a huge problem - we can either stop this deploy and do a revert, or go ahead with it if you're ok with it being deployed now [18:38:29] how can I help? [18:38:39] if you can deploy it now that would be great [18:38:40] i see it's just changing a percentage that was already at 10 so it doesn't _seem_ super risky to me? [18:38:44] it's not risky [18:38:51] ok, let's go ahead with it, if you don't mind waiting around a bit just in case. [18:38:55] yeah I'm here [18:39:02] cool, thanks all. [18:39:39] (03CR) 10CI reject: [V:04-1] apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [18:40:12] (sorry for the red herrings re: deployment server switchover.) [18:40:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) (owner: 10Dzahn) [18:40:52] Crisis averted! [18:44:58] :) [18:46:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:46:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:10] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216717 (10Dzahn) NDA confirmed. Taking care of the LDAP group memberships. @WMDE-leszek I am not familiar with Superset SQL lab and the `analytics-privatedata-users` group can be confi... [18:54:17] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191483 [18:54:22] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216719 (10Dzahn) I am starting out with the lower level of access. .this should take care of dashboards and web logins that also work for other WMDE staff. We can go from there. [18:54:44] (03CR) 10CI reject: [V:04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis) [18:55:47] !log LDAP - added member: uid=elishacohenwmde,ou=people,dc=wikimedia,dc=org [18:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:05] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191483 [18:56:10] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis) [18:56:22] !log LDAP - added uid=elishacohenwmde to 'wmde' and 'nda' T404359 [18:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:28] T404359: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359 [18:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (97.36%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [19:00:39] (03PS1) 10Dzahn: admin: add elishacohenwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1191485 (https://phabricator.wikimedia.org/T404359) [19:02:28] (03PS1) 10Dduvall: gitlab runners: Allow new buildkit-syntax-forwarder gateway [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) [19:03:48] (03PS3) 10CDanis: Export Prometheus metrics for MW primary DC & read only [puppet] - 10https://gerrit.wikimedia.org/r/1191483 [19:04:41] !log brennen@deploy2002 sgimeno, migr, brennen: Backport for [[gerrit:1191334|fix: provide a eventType fallback for already scheduled jobs (T405514)]], [[gerrit:1191414|fix: prevent type-error from outdated serialization (T405511)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:04:50] T405514: InvalidArgumentException: 'type' parameter is mandatory - https://phabricator.wikimedia.org/T405514 [19:04:50] T405511: TypeError: GrowthExperiments\NewcomerTasks\Task\TaskSet::__construct(): Argument #4 ($filters) must be of type GrowthExperiments\NewcomerTasks\Task\TaskSetFilters, array given - https://phabricator.wikimedia.org/T405511 [19:04:58] (03PS4) 10CDanis: Export Prometheus metrics for MW primary DC & read only [puppet] - 10https://gerrit.wikimedia.org/r/1191483 [19:05:29] !log brennen@deploy2002 sgimeno, migr, brennen: Continuing with sync [19:09:31] (03PS2) 10Dduvall: gitlab runners: Allow new buildkit-syntax-forwarder gateway [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) [19:09:59] 10ops-codfw, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405654 (10phaultfinder) 03NEW [19:11:41] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [19:11:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.05 - 2025.09.26): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11216802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with O... [19:12:01] hrm. this is certainly taking its own sweet time. [19:12:24] (03CR) 10Dzahn: [C:03+2] "self-merging this because I already did the LDAP groups.. follow-up with an upgrade to analytics-privatedata-users tbd" [puppet] - 10https://gerrit.wikimedia.org/r/1191485 (https://phabricator.wikimedia.org/T404359) (owner: 10Dzahn) [19:15:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11216808 (10Dzahn) @WMDE-leszek @ECohen_WMDE All the things tied to the "wmde"/"nda" LDAP groups (ability to merge in WMDE repos, web logins,..) should already wor... [19:15:47] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530#11216812 (10Jhancock.wm) 05Open→03Resolved [19:16:11] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11216814 (10Jhancock.wm) 05Open→03Resolved [19:18:43] 10ops-codfw, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405654#11216834 (10Jhancock.wm) 05Open→03Invalid [19:19:03] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191334|fix: provide a eventType fallback for already scheduled jobs (T405514)]], [[gerrit:1191414|fix: prevent type-error from outdated serialization (T405511)]] (duration: 54m 29s) [19:19:13] T405514: InvalidArgumentException: 'type' parameter is mandatory - https://phabricator.wikimedia.org/T405514 [19:19:14] T405511: TypeError: GrowthExperiments\NewcomerTasks\Task\TaskSet::__construct(): Argument #4 ($filters) must be of type GrowthExperiments\NewcomerTasks\Task\TaskSetFilters, array given - https://phabricator.wikimedia.org/T405511 [19:19:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216840 (10Dzahn) a:03Dzahn [19:19:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216842 (10EBomani) Thank you so much, @thcipriani, @Dzahn and @Aklapper! Also about changing the email, I went to the link (and even through my settings) but was unable to update it. My ema... [19:20:22] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191491 (https://phabricator.wikimedia.org/T396381) [19:20:25] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191491 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot) [19:21:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216865 (10EBomani) Actually, might also be the case that my other username on here (accidentally created when I was no longer a contractor) is the issue. [19:21:24] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191491 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot) [19:24:22] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [19:24:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216875 (10Dzahn) Ah, there are 2 users indeed! ` MariaDB [phabricator_user]> SELECT u.userName, ue.address, ue.isPrimary FROM phabricator_user.user u JOIN phabricator_user.user_email ue WH... [19:29:49] jhathaway@cumin2002 reimage (PID 549667) is awaiting input [19:30:21] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [19:31:21] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [19:31:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11216902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm executed with errors: -... [19:33:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:35:14] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.20 refs T396381 [19:35:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216920 (10Dzahn) [19:35:20] T396381: 1.45.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T396381 [19:36:23] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:36:36] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:42:00] (03PS4) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [19:42:09] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:43:34] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216931 (10Dzahn) Hey @EBomani For a moment, don't worry about the 2 Phabricator users. We can deal with this but treat the actual deployment access separately. Most boxes are checked. You h... [19:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:45:32] (03PS5) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [19:46:50] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:48:10] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11216953 (10Dzahn) @EBomani I am sending you an email to your new, non-contractor email account. Please take a look. [19:49:57] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:52:28] (03CR) 10Dzahn: [C:03+2] Phabricator: Update recipients of weekly Tech News mail [puppet] - 10https://gerrit.wikimedia.org/r/1191468 (https://phabricator.wikimedia.org/T405638) (owner: 10Aklapper) [19:53:29] (03PS6) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [19:54:49] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:55:14] jouncebot now [19:55:14] For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T1800) [19:56:12] train ops have wrapped up and things seem stable. [19:56:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:56:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:12] Mind if we start the backport window a few minutes early? [19:57:51] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve matching for users renamed multiple times [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) [19:59:27] (03CR) 10Dzahn: "rebasing so it can get done without waiting for the other request that still needs approval" [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T2000). [20:00:05] danisztls, cjming, and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191495 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:00:45] (03PS7) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [20:00:48] o/ I can self-deploy [20:01:24] danisztls so you mind if I start, I know you're before me in the queue but I have a bit of an urgent situation? [20:01:25] (03CR) 10Ottomata: [C:03+1] "VERY COOL." [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis) [20:01:27] cjming: you can go ahead of me, I'm doing a small change on one of my patches [20:01:33] stephanebisson: I don'tm ind [20:01:43] danisztls thanks! [20:02:09] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:03:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [20:03:49] hi ! cool thanks! [20:03:59] (03Merged) 10jenkins-bot: Special:Contribute: configure new page target title for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191377 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [20:04:19] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1191377|Special:Contribute: configure new page target title for enwiki (T327063)]] [20:04:26] T327063: Adjust "New page" option of the Contribute options to point to a community page when it exists - https://phabricator.wikimedia.org/T327063 [20:04:45] i added a maintenance script run to the window, i'd appreciate if someone could start it for me once you're done with the more important deployments [20:04:50] FIRING: DiskSpace: Disk space deploy1003:9100:/srv 2.826% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:57] (03PS2) 10Dzahn: admin: add user btracy with privateadata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) [20:05:18] the Disk space alert on deploy1003 will also be somehow related to the deployment server switch [20:05:31] (03PS3) 10DDesouza: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) [20:05:41] danisztls: lmk when you're done after stephanebisson finishes [20:06:28] and if you're not ready by then, i can do my backport [20:06:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:07:52] brennen: jnuche: deploy1003 is somehow at 98% usage on /srv. so the deployment server that is not active anymore since today.. it's using more than twice the space of deploy2002 [20:08:44] suddenly alerted while deploy2002 is in use.. dunno yet if just gradual build up or something happened just now [20:09:34] scap-master-sync possibly related since as you said earlier it runs each time with scap ? [20:09:44] mutante: well that's... weird. [20:10:28] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1191377|Special:Contribute: configure new page target title for enwiki (T327063)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:34] T327063: Adjust "New page" option of the Contribute options to point to a community page when it exists - https://phabricator.wikimedia.org/T327063 [20:10:49] !log sbisson@deploy2002 sbisson: Continuing with sync [20:12:15] (03PS1) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [20:12:49] (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:14:55] brennen: it's all about /srv/docker [20:14:57] (03PS8) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [20:15:14] that was my guess [20:15:50] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191377|Special:Contribute: configure new page target title for enwiki (T327063)]] (duration: 11m 31s) [20:15:57] T327063: Adjust "New page" option of the Contribute options to point to a community page when it exists - https://phabricator.wikimedia.org/T327063 [20:15:58] https://phabricator.wikimedia.org/T401647 [20:16:19] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis) [20:16:20] (03PS2) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [20:16:21] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:16:32] !log deploy1003 alerted because /srv/ is at 98% - T401647 [20:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:38] T401647: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647 [20:16:43] danisztls, cjming I'm done. Sorry for jumping the queue [20:16:49] (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:16:56] stephanebisson: no problem, I already started my deploy [20:17:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:17:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:17:38] stephanebisson: thanks! [20:17:46] danisztls: do you want to go next? [20:17:50] (03PS9) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [20:17:53] (03Merged) 10jenkins-bot: Pre-deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191454 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:18:02] (03Merged) 10jenkins-bot: Deploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191425 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:18:25] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1191454|Pre-deploy reader foundational survey on enwiki (T405410)]], [[gerrit:1191425|Deploy Design Research participant recruitment survey on jawiki (T405577)]] [20:18:33] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:18:33] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:18:45] (03CR) 10CDanis: [C:03+2] Export Prometheus metrics for MW primary DC & read only [puppet] - 10https://gerrit.wikimedia.org/r/1191483 (owner: 10CDanis) [20:20:09] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11217099 (10Dzahn) The alerted happened during of the first deploys after `deploy2002` became the active deployment server today. Seeing the other deployment server alert suddenly made me look. Thinking it wa... [20:20:37] (03PS2) 10BCornwall: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:21:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11217124 (10Papaul) [20:24:34] !log dani@deploy2002 dani: Backport for [[gerrit:1191454|Pre-deploy reader foundational survey on enwiki (T405410)]], [[gerrit:1191425|Deploy Design Research participant recruitment survey on jawiki (T405577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:43] 06SRE, 06serviceops-radar, 06Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#11217157 (10Dzahn) We got a new alert today. It was deploy1003 at 98% on /srv/. Still about /srv/docker. [20:24:43] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:24:43] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:26:02] !log dani@deploy2002 dani: Continuing with sync [20:26:03] (03PS1) 10CDanis: Mediawiki Etcd Prometheus: use datacenter label [puppet] - 10https://gerrit.wikimedia.org/r/1191501 [20:26:34] (03CR) 10CDanis: [C:03+2] Mediawiki Etcd Prometheus: use datacenter label [puppet] - 10https://gerrit.wikimedia.org/r/1191501 (owner: 10CDanis) [20:28:49] (03PS3) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [20:28:53] cjming: all yours [20:29:16] (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:29:58] (03PS1) 10Krinkle: [WIP] varnish: No-op for CI [puppet] - 10https://gerrit.wikimedia.org/r/1191502 [20:31:00] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191454|Pre-deploy reader foundational survey on enwiki (T405410)]], [[gerrit:1191425|Deploy Design Research participant recruitment survey on jawiki (T405577)]] (duration: 12m 35s) [20:31:09] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:31:09] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:31:44] ty! [20:32:19] (03PS2) 10Krinkle: [WIP] varnish: No-op for CI [puppet] - 10https://gerrit.wikimedia.org/r/1191502 [20:32:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 (owner: 10Clare Ming) [20:34:37] !log [releases2003:~] $ sudo systemctl reset-failed - monitoring alerted about failed rsync from deploy1003 after active deployment server switched to deploy2002 today - T405646 [20:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:43] T405646: SystemdUnitFailed - rsync on releases2003 - https://phabricator.wikimedia.org/T405646 [20:35:13] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [20:35:49] (03CR) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:36:33] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 [20:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:51] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 [20:38:55] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [20:39:03] (03CR) 10Dzahn: [C:03+2] admin: add user btracy with privateadata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191472 (https://phabricator.wikimedia.org/T405366) (owner: 10Dzahn) [20:40:43] (03Merged) 10jenkins-bot: xLab: instrument page visits with delayed events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191447 (owner: 10Clare Ming) [20:40:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11217235 (10Dzahn) Hey @BTracy-WMF You have been added to the `analytics-privatedata-users` group as requested. Let us know if you ne... [20:41:03] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1191447|xLab: instrument page visits with delayed events]] [20:43:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11217244 (10Dzahn) 05In progress→03Resolved You can try out Superset now. [20:44:05] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [20:44:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.499s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:44:29] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [20:44:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11217247 (10Dzahn) [20:46:35] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [20:47:00] (03PS1) 10TChin: [eventgate_*] Bump eventgate to v1.25.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191506 (https://phabricator.wikimedia.org/T403169) [20:47:03] !log cjming@deploy2002 cjming: Backport for [[gerrit:1191447|xLab: instrument page visits with delayed events]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:47:19] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [20:47:27] !log cjming@deploy2002 cjming: Continuing with sync [20:48:36] (03Abandoned) 10Krinkle: [WIP] varnish: No-op for CI [puppet] - 10https://gerrit.wikimedia.org/r/1191502 (owner: 10Krinkle) [20:48:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11217261 (10Dzahn) a:05Dzahn→03cmadeo Hello @cmadeo Dayforce says you are the manager of @TLessa-WMF and to complete this acc... [20:49:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-int releases routed via main (k8s) 1.04s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:49:47] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217265 (10Dzahn) a:05Dzahn→03EBomani [20:51:09] (03PS4) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [20:52:20] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191447|xLab: instrument page visits with delayed events]] (duration: 11m 17s) [20:52:49] MatmaRex: all yours! [20:53:06] (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:53:27] thanks. i don't have deployment access, is anyone around who could deploy my thing? [20:53:36] eh, i guess it's almost the end of the window already [20:53:39] (03PS1) 10Dzahn: admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) [20:53:41] i'll schedule it for monday :) [20:53:44] oh! i can do it [20:53:55] (03CR) 10CI reject: [V:04-1] admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) (owner: 10Dzahn) [20:54:24] MatmaRex: want me to deploy your stuff? [20:54:27] cjming: thanks, i think i'll reschedule it for monday, i don't want to sit here until late myself either [20:54:32] (03PS2) 10Dzahn: admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) [20:54:41] alrighty [20:54:57] thanks for the offer [20:55:36] (03PS3) 10Dzahn: admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) [20:55:52] !log end of UTC late backport window [20:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250925T2100) [21:01:53] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11217325 (10jhathaway) >>! In T404356#11209467, @MatthewVernon wrote: > Does `/boot` even need to be on a separate partition for UEFI... [21:02:42] (03PS3) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 [21:04:54] (03PS4) 10Krinkle: Disable wmgUseMdotRouting on misc *.wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [21:05:03] (03PS5) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [21:05:43] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11217341 (10jhathaway) >>! In T404356#11184299, @elukey wrote: > The host doesn't PXE/HTTP boot for some reason, I reopened the provi... [21:07:45] (03CR) 10DDesouza: [C:03+2] "That makes sense. Sorry about that. I was under the wrong impression that patches to labs bypassed the normal backporting process. Next ti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191427 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [21:09:48] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [21:10:40] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Docker [21:12:30] (03PS1) 10DDesouza: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) [21:18:20] (03CR) 10RLazarus: [C:03+2] wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) (owner: 10RLazarus) [21:20:01] (03Merged) 10jenkins-bot: wikifeeds: Remove envoy image_version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191203 (https://phabricator.wikimedia.org/T368366) (owner: 10RLazarus) [21:21:44] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [21:21:53] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [21:23:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.566s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:24:31] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [21:24:50] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [21:24:50] (03PS2) 10DDesouza: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191510 (https://phabricator.wikimedia.org/T405410) [21:25:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:27:51] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772 [21:27:57] T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772 [21:28:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.566s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:29:15] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [21:29:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.625s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:29:42] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [21:32:54] (03PS1) 10Ryan Kemper: wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) [21:33:30] (03PS2) 10Ryan Kemper: wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) [21:33:46] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:34:09] (03CR) 10Bking: [C:03+1] wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:34:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.625s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:34:39] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11217377 (10RLazarus) 05Open→03Resolved 1.23 is gone. 🎉 [21:37:19] (03PS3) 10Ryan Kemper: wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) [21:39:25] (03PS1) 10Krinkle: Disable inert MobileFrontend on misc wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) [21:40:08] (03CR) 10CI reject: [V:04-1] Disable inert MobileFrontend on misc wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle) [21:40:37] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:42:28] (03CR) 10Ryan Kemper: [C:03+2] wdqs: these hosts no longer in wdqs-public [puppet] - 10https://gerrit.wikimedia.org/r/1191513 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:43:16] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772 [21:43:23] T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772 [21:50:20] 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217414 (10Krinkle) [21:50:28] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101#11217420 (10RLazarus) 05Open→03Resolved [21:50:57] (03PS2) 10Krinkle: Disable inert MobileFrontend on misc wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) [21:52:05] (03PS3) 10Krinkle: Disable inert MobileFrontend on wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) [21:53:23] (03PS4) 10Krinkle: Disable inert MobileFrontend on wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) [21:54:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle) [21:55:35] (03Merged) 10jenkins-bot: Disable inert MobileFrontend on wikimedia.org wikis that lack DNS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle) [21:55:57] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1191514|Disable inert MobileFrontend on wikimedia.org wikis that lack DNS (T152882)]] [21:56:03] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [22:02:09] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1191514|Disable inert MobileFrontend on wikimedia.org wikis that lack DNS (T152882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:02:15] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [22:04:41] !log krinkle@deploy2002 krinkle: Continuing with sync [22:05:09] (03PS6) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [22:07:56] (03PS1) 10RLazarus: mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) [22:07:59] (03PS1) 10RLazarus: mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) [22:09:49] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191514|Disable inert MobileFrontend on wikimedia.org wikis that lack DNS (T152882)]] (duration: 13m 52s) [22:09:56] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [22:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:16:20] (03PS1) 10Ryan Kemper: wdqs: shift old full graph hosts to wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1191525 (https://phabricator.wikimedia.org/T395772) [22:17:45] (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.29.12 [puppet] - 10https://gerrit.wikimedia.org/r/1191526 (https://phabricator.wikimedia.org/T403663) [22:18:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11217559 (10TLessa-WMF) @Dzahn document signed, thank you so much for your help! @cmadeo for context here, I am trying to be ab... [22:21:55] (03CR) 10RLazarus: "(Stacking this up for after the two MW patches in the charts repo, per Depends-On.)" [puppet] - 10https://gerrit.wikimedia.org/r/1191526 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [22:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:49:02] (03PS7) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [22:52:06] 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217641 (10Krinkle) [22:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:25] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (98%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [23:02:14] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217659 (10Krinkle) I was originally going to enable unified mobile routing on login.wikimedia.org today, as part of the misc wikimedia.org batch at T403510. Ho... [23:03:54] (03CR) 10RLazarus: [C:03+2] api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:05:46] (03Merged) 10jenkins-bot: api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:08:12] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [23:08:25] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [23:10:03] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [23:10:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [23:25:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217708 (10EBomani) Hello @Dzahn, got your Email and sent over a verification response. Thanks for getting to that so swiftly :)) @thcipriani and I are going to meet next week for the next... [23:29:08] !log releases2003 - re-enabling puppet which was disabled for debugging T405352 - then the deployment server failover happened and this server didn't get the update what the active deployment server was.. which subsequently caused T405646 [23:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:17] T405352: APT error when installing Jenkins package in releases instances - https://phabricator.wikimedia.org/T405352 [23:29:17] T405646: SystemdUnitFailed - rsync on releases2003 - https://phabricator.wikimedia.org/T405646 [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191537 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191537 (owner: 10TrainBranchBot) [23:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:57:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191537 (owner: 10TrainBranchBot) [23:59:20] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217753 (10Dzahn) @EBomani Thank you. I received your response. We can check that box as well :) I will make a patch tomorrow and next week's clinic duty person can merge it. [23:59:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217754 (10Dzahn) [23:59:47] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11217755 (10Dzahn) a:05EBomani→03Dzahn