[17:07:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:07:41] (03CR) 10Krinkle: kubernetes: Set default Envoy version to 1.26.8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185984 (https://phabricator.wikimedia.org/T402854) (owner: 10RLazarus) [17:08:18] (03CR) 10JHathaway: [C:03+2] dell: add iDRAC hw_model [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190364 (owner: 10JHathaway) [17:09:08] (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.26.8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185984 (https://phabricator.wikimedia.org/T402854) (owner: 10RLazarus) [17:12:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:13:23] update: services and traffic have been moved to codfw, we'll be monitoring all the things but do free to let us know if you notice anything unusual [17:13:38] (03CR) 10JHathaway: [V:03+2 C:03+2] dell: add iDRAC hw_model [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190364 (owner: 10JHathaway) [17:14:03] (03CR) 10JHathaway: [C:03+2] dell: replace self.idrac_10_min_gen with self.hw_model [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190365 (owner: 10JHathaway) [17:14:19] (03CR) 10JHathaway: redfish: improve log_entries for idrac 10 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189518 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [17:18:14] (03CR) 10JHathaway: [V:03+2 C:03+2] dell: replace self.idrac_10_min_gen with self.hw_model [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190365 (owner: 10JHathaway) [17:18:23] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [17:18:43] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [17:20:21] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [17:22:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:24:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [17:25:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:30:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:32:02] (03CR) 10Scott French: "Oh, that's awesome! If there's a script that can do this automatically in the future, that would be a lot more pleasant than doing it by h" [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [17:32:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:32:51] (03PS2) 10Krinkle: varnish: Expand samsung-related test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/1190361 (https://phabricator.wikimedia.org/T405279) [17:33:53] (03CR) 10Scott French: "Thanks, Amir!" [dns] - 10https://gerrit.wikimedia.org/r/1189939 (https://phabricator.wikimedia.org/T399891) (owner: 10Gerrit maintenance bot) [17:34:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:35] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:37:50] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:38:05] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:38:30] (03PS1) 10Majavah: P:toolforge: legacy_redirector: Update redirects for toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1190726 (https://phabricator.wikimedia.org/T271862) [17:39:26] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:41:35] (03CR) 10Cathal Mooney: [C:03+2] Nokia: extract system loopbacks IPs in nokia_asw.py for later [homer/public] - 10https://gerrit.wikimedia.org/r/1190637 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:41:35] (03PS2) 10Krinkle: varnish: Avoid "samsung" device token matching "SamsungBrowser" [puppet] - 10https://gerrit.wikimedia.org/r/1190362 (https://phabricator.wikimedia.org/T405279) [17:42:35] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:42:50] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:42:56] (03Merged) 10jenkins-bot: Nokia: extract system loopbacks IPs in nokia_asw.py for later [homer/public] - 10https://gerrit.wikimedia.org/r/1190637 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [17:43:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bookworm [17:47:30] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [17:47:35] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:49:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11207206 (10phaultfinder) [17:52:35] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [17:52:50] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:54:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:59:26] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:59:37] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:00:05] brennen and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250923T1800). nyaa~ [18:00:11] o/ [18:03:07] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [18:03:54] 14SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810#11207294 (10taavi) [18:04:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:04:33] (03PS1) 10Cathal Mooney: Nokia: Add DHCP relay and IPv6 RA generation on IRB interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1190732 (https://phabricator.wikimedia.org/T402577) [18:05:00] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [18:05:17] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bookworm [18:08:01] !log bking@cumin1002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [18:08:03] (03PS2) 10Cathal Mooney: Nokia: Add DHCP relay and IPv6 RA generation on IRB interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1190732 (https://phabricator.wikimedia.org/T402577) [18:08:04] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [18:09:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11207327 (10phaultfinder) [18:10:03] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:14:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:14:26] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:16:28] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:17:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:18:49] (03CR) 10BCornwall: [V:03+2 C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1190361 (https://phabricator.wikimedia.org/T405279) (owner: 10Krinkle) [18:18:50] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:19:09] FIRING: [12x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:09] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:19:17] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:19:28] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:21:04] (03PS1) 10Ebernhardson: cirrus: Send more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190737 [18:22:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:23:50] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:24:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate kibana.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:25:30] (03CR) 10CDanis: [C:03+1] cirrus: Send more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190737 (owner: 10Ebernhardson) [18:26:15] (03PS2) 10Ebernhardson: cirrus: Send more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190737 (https://phabricator.wikimedia.org/T405394) [18:26:32] (03PS1) 10Gergő Tisza: Revert "User: Reduce locking severity of ::getInstanceForUpdate()" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190738 [18:26:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190737 (https://phabricator.wikimedia.org/T405394) (owner: 10Ebernhardson) [18:27:12] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [18:27:51] (03Merged) 10jenkins-bot: cirrus: Send more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190737 (https://phabricator.wikimedia.org/T405394) (owner: 10Ebernhardson) [18:28:21] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1190737|cirrus: Send more_like traffic to eqiad (T405394)]] [18:28:27] T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394 [18:29:50] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:13] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11207406 (10ATsay-WMF) I approve this request! [18:31:34] (03PS3) 10Krinkle: varnish: Avoid "samsung" device token matching "SamsungBrowser" [puppet] - 10https://gerrit.wikimedia.org/r/1190362 (https://phabricator.wikimedia.org/T405279) [18:32:33] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1190737|cirrus: Send more_like traffic to eqiad (T405394)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:32:44] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [18:33:56] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [18:34:30] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:37:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:38:55] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190737|cirrus: Send more_like traffic to eqiad (T405394)]] (duration: 10m 34s) [18:39:02] T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394 [18:41:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:42:09] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1038 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190708 (owner: 10Andrew Bogott) [18:43:52] (03PS1) 10Aaron Schulz: Switch wgRestSandboxSpecs wmf-restbase entry on testwiki to the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) [18:43:54] (03PS1) 10Aaron Schulz: Switch wgRestSandboxSpecs wmf-restbase entry to the static specs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) [18:44:39] (03PS2) 10Aaron Schulz: [DNM] Switch wgRestSandboxSpecs wmf-restbase entry on testwiki to the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) [18:49:10] (03CR) 10BCornwall: [V:03+2 C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1190362 (https://phabricator.wikimedia.org/T405279) (owner: 10Krinkle) [18:49:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190738 (owner: 10Gergő Tisza) [18:53:22] 10SRE-tools, 10Spicerack: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397 (10Scott_French) 03NEW [18:53:33] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1038.eqiad.wmnet with OS bookworm [18:54:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:54:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11207493 (10phaultfinder) [18:55:20] (03Merged) 10jenkins-bot: Revert "User: Reduce locking severity of ::getInstanceForUpdate()" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190738 (owner: 10Gergő Tisza) [18:55:44] (03CR) 10CDanis: [C:03+1] pyrra: add the Charts SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1190620 (https://phabricator.wikimedia.org/T399613) (owner: 10Elukey) [18:55:46] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1190738|Revert "User: Reduce locking severity of ::getInstanceForUpdate()"]] [18:57:01] FIRING: [2x] ProbeDown: Service wdqs2018:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bookworm [18:59:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:01:50] !log brennen@deploy1003 brennen, tgr: Backport for [[gerrit:1190738|Revert "User: Reduce locking severity of ::getInstanceForUpdate()"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:02:24] !log brennen@deploy1003 brennen, tgr: Continuing with sync [19:03:37] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11207520 (10Scott_French) For completeness, another option might be to do pretty much exactly what https://gerrit.wiki... [19:04:09] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Expand samsung-related test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/1190361 (https://phabricator.wikimedia.org/T405279) (owner: 10Krinkle) [19:04:11] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Avoid "samsung" device token matching "SamsungBrowser" [puppet] - 10https://gerrit.wikimedia.org/r/1190362 (https://phabricator.wikimedia.org/T405279) (owner: 10Krinkle) [19:07:12] (03PS1) 10Aaron Schulz: [DNM] rest-gateway: map restbase sandbox URLs to Special:RestSandbox/wmf-restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190753 (https://phabricator.wikimedia.org/T396807) [19:07:22] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190738|Revert "User: Reduce locking severity of ::getInstanceForUpdate()"]] (duration: 11m 36s) [19:08:47] (03PS1) 10Aaron Schulz: [DNM] rest-gateway: migrate /api/rest_v1/ (sandbox) to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) [19:09:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:09:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11207555 (10phaultfinder) [19:14:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:14:37] (03PS3) 10Aaron Schulz: [DNM] Switch wgRestSandboxSpecs wmf-restbase entry on testwiki to the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) [19:14:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11207564 (10phaultfinder) [19:15:00] (03CR) 10CI reject: [V:04-1] [DNM] Switch wgRestSandboxSpecs wmf-restbase entry on testwiki to the static specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [19:18:50] !log brennen Deployed security patch for T405112 [19:19:41] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage [19:23:21] (03PS3) 10Aaron Schulz: [DNM] Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) [19:24:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage [19:24:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:29:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:29:50] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:29:52] 10ops-codfw, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402 (10phaultfinder) 03NEW [19:32:28] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1039 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190709 (owner: 10Andrew Bogott) [19:34:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11207672 (10phaultfinder) [19:35:23] !log brennen Deployed security patch for T405112 [19:36:31] (03PS1) 10Bking: cirrussearch: re-enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1190756 (https://phabricator.wikimedia.org/T386860) [19:38:14] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190757 (https://phabricator.wikimedia.org/T396381) [19:38:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190757 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot) [19:39:09] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190757 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot) [19:42:49] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bookworm [19:44:10] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bookworm [19:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:44:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11207695 (10phaultfinder) [19:44:59] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wikibooks and Wikiquote [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) [19:46:57] (03PS1) 10Andrew Bogott: cloudcephosd1040 -> reef and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1190760 [19:47:26] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wikibooks and Wikiquote (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190761 (https://phabricator.wikimedia.org/T403510) [19:47:32] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1040 -> reef and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1190760 (owner: 10Andrew Bogott) [19:50:14] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.20 refs T396381 [19:50:20] T396381: 1.45.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T396381 [19:50:43] (03PS4) 10Vgutierrez: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) [19:53:02] (03PS1) 10Andrew Bogott: cloudcephosd1045 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190763 [19:53:02] (03PS1) 10Andrew Bogott: cloudcephosd1046 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190764 [19:53:02] (03PS1) 10Andrew Bogott: cloudcephosd1047 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190765 [19:53:03] (03PS1) 10Andrew Bogott: cloudcephosd1048 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190766 [19:53:04] (03PS1) 10Andrew Bogott: cloudcephosd1049 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190767 [19:53:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:54:51] 10ops-codfw, 06DC-Ops: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405403 (10phaultfinder) 03NEW [19:54:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11207713 (10phaultfinder) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250923T2000). [20:00:05] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:03:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:05:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:06:20] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [20:07:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:08:44] I'll self-deploy [20:09:27] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [20:10:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:11:25] (03PS5) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [20:12:00] RESOLVED: [2x] ProbeDown: Service wdqs2018:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:13:24] (03Abandoned) 10Andrew Bogott: cloudcephosd1038: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167917 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott) [20:13:30] (03Abandoned) 10Andrew Bogott: cloudcephosd1039: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167918 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott) [20:13:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190712 (https://phabricator.wikimedia.org/T399243) (owner: 10Gergő Tisza) [20:13:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190713 (https://phabricator.wikimedia.org/T399243) (owner: 10Gergő Tisza) [20:14:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11207803 (10phaultfinder) [20:15:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:16:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:25:11] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11207811 (10phaultfinder) [20:27:14] (03Merged) 10jenkins-bot: session: Fix date handling for JWT cookies [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190712 (https://phabricator.wikimedia.org/T399243) (owner: 10Gergő Tisza) [20:27:21] (03Merged) 10jenkins-bot: session: Fix date handling for JWT cookies [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190713 (https://phabricator.wikimedia.org/T399243) (owner: 10Gergő Tisza) [20:28:17] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1190712|session: Fix date handling for JWT cookies (T399243 T399200)]], [[gerrit:1190713|session: Fix date handling for JWT cookies (T399243 T399200)]] [20:28:32] T399243: Support JWT generation for session tokens in MediaWiki core - https://phabricator.wikimedia.org/T399243 [20:28:33] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [20:29:16] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bookworm [20:31:27] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install dse-k8s-worker200[4-5] - https://phabricator.wikimedia.org/T405406 (10RobH) 03NEW [20:31:49] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install dse-k8s-worker200[4-5] - https://phabricator.wikimedia.org/T405406#11207844 (10RobH) [20:31:57] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11207845 (10RobH) [20:32:02] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1045 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190763 (owner: 10Andrew Bogott) [20:32:24] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11207849 (10RobH) [20:32:52] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11207854 (10RobH) a:03BTullis @btullis, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operation... [20:34:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11207863 (10phaultfinder) [20:35:07] andrew@cumin2002 reimage (PID 3337329) is awaiting input [20:35:45] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bookworm [20:36:24] (03CR) 10BCornwall: varnish: Enable unified mobile routing on Wikibooks and Wikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:38:12] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [20:39:20] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11207876 (10Andrew) 05In progress→03Resolved I am now upgrading the cluster to Bookworm + Reef (18.x) and that... [20:39:36] (03CR) 10Krinkle: varnish: Enable unified mobile routing on Wikibooks and Wikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:41:43] (03CR) 10BCornwall: varnish: Enable unified mobile routing on Wikibooks and Wikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:44:06] (03PS6) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [20:44:55] (03PS2) 10RLazarus: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T403663) [20:44:57] (03PS2) 10RLazarus: api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) [20:45:08] (03CR) 10CI reject: [V:04-1] api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:47:00] (03CR) 10Krinkle: varnish: Enable unified mobile routing on Wikibooks and Wikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:47:11] (03PS2) 10Scott French: dnsdisc: set a timeout on udp_with_fallback [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) [20:47:31] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Wikibooks and Wikiquote [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) [20:49:47] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on Wikibooks and Wikiquote [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) [20:51:32] (03CR) 10BCornwall: [V:03+2 C:03+2] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1190759 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [20:52:04] (03CR) 10Volans: [C:03+1] "LGTM, couple of typos in comments/docs inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) (owner: 10Scott French) [20:53:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:42] !log tgr@deploy1003 tgr: Backport for [[gerrit:1190712|session: Fix date handling for JWT cookies (T399243 T399200)]], [[gerrit:1190713|session: Fix date handling for JWT cookies (T399243 T399200)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:49] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1045.eqiad.wmnet with reason: host reimage [20:55:50] T399243: Support JWT generation for session tokens in MediaWiki core - https://phabricator.wikimedia.org/T399243 [20:55:51] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [20:57:01] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11207972 (10JVanderhoop-WMF) [20:57:09] (03CR) 10RLazarus: "Hm, I was going to go ahead and deploy this since it's a no-op for the ongoing T401396 work -- but since those patches are merged but not " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [20:57:16] 06SRE, 10DNS, 06FR-donorrelations, 06Traffic: Custom URL for survey pop-up - https://phabricator.wikimedia.org/T400278#11207973 (10BCornwall) 05Open→03Stalled [20:57:22] !log tgr@deploy1003 tgr: Continuing with sync [20:59:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1045.eqiad.wmnet with reason: host reimage [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250923T2100) [21:03:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:07] (03PS3) 10Scott French: dnsdisc: set a timeout on udp_with_fallback [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) [21:06:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:06:49] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) (owner: 10Scott French) [21:07:00] (03CR) 10Scott French: "Thank you very much! Not sure how I managed to make quite that many typos in such a small patch :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) (owner: 10Scott French) [21:08:15] (03CR) 10RLazarus: "Meant to say, merged but not deployed *to api-gateway. (The server_header_transformation change is the only one that appears in the diff, " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [21:09:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11207997 (10phaultfinder) [21:10:09] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190712|session: Fix date handling for JWT cookies (T399243 T399200)]], [[gerrit:1190713|session: Fix date handling for JWT cookies (T399243 T399200)]] (duration: 41m 51s) [21:10:17] T399243: Support JWT generation for session tokens in MediaWiki core - https://phabricator.wikimedia.org/T399243 [21:10:18] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [21:11:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:11:47] !log UTC late deploys done [21:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:31] (03PS3) 10Bking: cirrussearch: re-enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1190756 (https://phabricator.wikimedia.org/T386860) [21:15:05] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405403#11208009 (10phaultfinder) [21:17:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1045.eqiad.wmnet with OS bookworm [21:18:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1046.eqiad.wmnet with OS bookworm [21:19:02] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1046 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190764 (owner: 10Andrew Bogott) [21:20:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:22:05] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11208025 (10CDobbins) @MoritzMuehlenhoff I did some digging into what implementing this would entail. There are a couple of concerns, I think: * The [[ https://doc.powerdns.co... [21:25:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:37:25] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11208037 (10Scott_French) While we should not need this for Day 2 of the switchover tomorrow (se... [21:38:05] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 10Spicerack, 13Patch-For-Review: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11208038 (10Scott_French) p:05Triage→03Medium [21:38:38] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1046.eqiad.wmnet with reason: host reimage [21:40:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11208040 (10phaultfinder) [21:43:00] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: re-enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1190756 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [21:43:09] (03CR) 10Bking: [C:03+2] cirrussearch: re-enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1190756 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [21:44:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1046.eqiad.wmnet with reason: host reimage [21:44:09] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [21:46:52] PROBLEM - ensure kvm processes are running on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:47:52] RECOVERY - ensure kvm processes are running on cloudvirt1057 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:49:06] (03CR) 10Ladsgroup: "x2 doesn't exist anymore. We should fully remove it." [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [21:54:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11208069 (10phaultfinder) [21:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11208070 (10phaultfinder) [22:03:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1046.eqiad.wmnet with OS bookworm [22:09:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208096 (10phaultfinder) [22:14:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11208101 (10phaultfinder) [22:16:28] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:19:09] FIRING: [12x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:28] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:19:50] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:24:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate kibana.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:29:50] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208153 (10phaultfinder) [22:32:40] (03CR) 10Herron: [C:03+1] pyrra: add the Charts SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1190620 (https://phabricator.wikimedia.org/T399613) (owner: 10Elukey) [22:37:33] (03PS13) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [22:38:00] (03CR) 10CI reject: [V:04-1] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [22:38:58] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for LMorgantini - https://phabricator.wikimedia.org/T405405#11208173 (10Aklapper) 05Open→03Invalid Hi and welcome! I am curious which docs you followed to end up filing a task. They may welcome updating. :) Please see https://phabricator.wikimedi... [22:41:29] (03PS14) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [22:43:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:34] (03PS1) 10RLazarus: envoyproxy, services_proxy: Upgrade configuration for Envoy 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) [22:52:35] (03CR) 10CI reject: [V:04-1] envoyproxy, services_proxy: Upgrade configuration for Envoy 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [22:54:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:47] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Revision-deletion, and 2 others: Revision deletion on image files is excessively slow - https://phabricator.wikimedia.org/T403572#11208206 (10HCoplin-WMF) p:05Triage→03Low You're right that this excessively slow. That being said,... [22:57:46] (03PS2) 10RLazarus: envoyproxy, services_proxy: Upgrade configuration for Envoy 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) [22:59:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:04:39] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm [23:04:43] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1047 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190765 (owner: 10Andrew Bogott) [23:06:59] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:08:03] (03CR) 10Jasmine: "Hi folks, much apologies about this, but we may need to remove the backport window directly adjacent to the Mediawiki switchover today. Ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [23:09:25] (03CR) 10Jasmine: "Hi folks, much apologies about this, but we may need to remove the backport window directly adjacent to the Mediawiki switchover today. Ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190684 (https://phabricator.wikimedia.org/T403613) (owner: 10Matthias Mullie) [23:10:04] (03CR) 10Jasmine: "Hi folks, much apologies about this, but we may need to remove the backport window directly adjacent to the Mediawiki switchover today. Ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [23:24:24] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage [23:27:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage [23:30:39] (03PS1) 10Krinkle: beta: Pool deployment-poolcounter07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190796 (https://phabricator.wikimedia.org/T380881) [23:33:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190796 (https://phabricator.wikimedia.org/T380881) (owner: 10Krinkle) [23:34:18] (03Merged) 10jenkins-bot: beta: Pool deployment-poolcounter07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190796 (https://phabricator.wikimedia.org/T380881) (owner: 10Krinkle) [23:35:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208385 (10phaultfinder) [23:37:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190761 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1190799 [23:38:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1190799 (owner: 10TrainBranchBot) [23:38:42] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Wikibooks and Wikiquote (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190761 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [23:39:10] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1190761|Disable wmgUseMdotRouting on Wikibooks and Wikiquote (group1) (T403510)]] [23:39:16] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:45:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1047.eqiad.wmnet with OS bookworm [23:47:10] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1190761|Disable wmgUseMdotRouting on Wikibooks and Wikiquote (group1) (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:47:16] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [23:47:52] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bookworm [23:48:06] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1048 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190766 (owner: 10Andrew Bogott) [23:49:01] !log krinkle@deploy1003 krinkle: Continuing with sync [23:54:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1190799 (owner: 10TrainBranchBot) [23:55:49] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190761|Disable wmgUseMdotRouting on Wikibooks and Wikiquote (group1) (T403510)]] (duration: 16m 39s) [23:55:55] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510