[00:04:09] FIRING: [12x] SystemdUnitFailed: wmf_auto_restart_rsyslog.service on cirrussearch1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1190803 [00:08:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1190803 (owner: 10TrainBranchBot) [00:16:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1190803 (owner: 10TrainBranchBot) [00:32:39] (03PS3) 10RLazarus: envoyproxy, services_proxy: Update configuration for Envoy 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) [00:33:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:40:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.205 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:40:56] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [00:43:27] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:44:18] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7030/co" [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:44:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [00:45:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.205 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:58:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bookworm [01:00:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:00:56] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:01:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [01:05:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [01:05:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:06:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [01:10:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:12:55] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 58s) [01:15:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [01:17:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bookworm [01:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:38] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [01:40:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [01:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [01:50:26] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1049 -> bookworm + ceph 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1190767 (owner: 10Andrew Bogott) [01:51:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:21] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bookworm [01:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405403#11208734 (10phaultfinder) [02:16:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:19:43] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:19:58] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:23:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:24:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [02:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:26:11] (03PS3) 10RLazarus: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T403663) [02:26:11] (03PS3) 10RLazarus: api-gateway: Update configuration for Envoy 1.29 field deprecations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) [02:29:50] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208741 (10phaultfinder) [02:38:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:39:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [02:49:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11208756 (10phaultfinder) [02:53:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:58:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [03:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [03:09:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [03:13:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [03:23:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [03:24:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [03:35:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208782 (10phaultfinder) [03:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:07:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [04:08:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [04:12:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [04:14:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:14:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208801 (10phaultfinder) [04:19:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:32:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [04:33:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [04:35:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11208805 (10phaultfinder) [04:35:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11208807 (10phaultfinder) [04:40:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:41:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:51:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11208821 (10phaultfinder) [04:54:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:57:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:58:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [04:59:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [04:59:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208836 (10phaultfinder) [04:59:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:02:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:03:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [05:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:23:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [05:24:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208846 (10phaultfinder) [05:28:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [05:29:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [05:39:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11208850 (10phaultfinder) [05:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [05:44:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208861 (10phaultfinder) [05:50:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11208874 (10phaultfinder) [05:50:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [05:54:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11208875 (10phaultfinder) [05:55:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [05:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208877 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T0600) [06:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.72% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:10:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [06:12:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [06:13:08] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: legacy_redirector: Update redirects for toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1190726 (https://phabricator.wikimedia.org/T271862) (owner: 10Majavah) [06:13:23] (03CR) 10Filippo Giunchedi: [C:03+2] cloudnfs: install python3-netifaces [puppet] - 10https://gerrit.wikimedia.org/r/1190662 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [06:13:45] (03CR) 10Filippo Giunchedi: [C:03+2] cloudnfs: add Trixie support [puppet] - 10https://gerrit.wikimedia.org/r/1190666 (https://phabricator.wikimedia.org/T404584) (owner: 10Filippo Giunchedi) [06:16:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:16:45] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:17:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [06:19:43] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:19:51] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:19:58] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:24:05] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [06:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:28:50] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [06:29:50] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [06:33:50] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [06:34:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190684 (https://phabricator.wikimedia.org/T403613) (owner: 10Matthias Mullie) [06:36:00] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#11208956 (10ABran-WMF) [06:37:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [06:39:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [06:46:51] (03PS1) 10Muehlenhoff: Enable OSM and waterlines sync on maps1011 [puppet] - 10https://gerrit.wikimedia.org/r/1190826 (https://phabricator.wikimedia.org/T381565) [06:48:27] jouncebot: nowandnext [06:48:27] For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T0600) [06:48:27] In 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T0700) [06:49:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [06:49:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11208983 (10phaultfinder) [06:50:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190184 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [06:51:14] (03Merged) 10jenkins-bot: hCaptcha: Enable account creation trial on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190184 (https://phabricator.wikimedia.org/T402366) (owner: 10Kosta Harlan) [06:51:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [06:52:00] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1190184|hCaptcha: Enable account creation trial on phase 2 wikis (T402366)]] [06:52:06] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [06:58:41] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1190184|hCaptcha: Enable account creation trial on phase 2 wikis (T402366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:58:47] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:00:06] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T0700). [07:00:06] kostajh, sergi0, and matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] hi, I'm deploying my config patch at the moment [07:01:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [07:02:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet [07:04:26] o/ [07:04:51] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:04:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11208989 (10phaultfinder) [07:06:16] !log kharlan@deploy1003 kharlan: Continuing with sync [07:07:12] (03PS1) 10Muehlenhoff: Add maps2012-maps2014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1190923 (https://phabricator.wikimedia.org/T381565) [07:08:50] Oh I was wondering - can we deploy patches pretty much any time with spiderpig now, or is it still preferred to stick to these backport windows? [07:11:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:11:07] matthiasmullie: aiui, as long as no one else is deploying, you can use time outside of these windows [07:11:26] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190184|hCaptcha: Enable account creation trial on phase 2 wikis (T402366)]] (duration: 19m 26s) [07:11:32] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:13:49] @kostajh right; pretty much same as it was then - thanks! [07:14:01] matthiasmullie: I'm done syncing, so over to you [07:15:29] (03PS2) 10Elukey: redfish: improve log_entries for idrac 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189518 (https://phabricator.wikimedia.org/T392851) [07:15:33] Thanks - starting [07:15:39] (03CR) 10Elukey: redfish: improve log_entries for idrac 10 (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189518 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:15:45] (03PS2) 10Matthias Mullie: Add MediaSearch custommatch:linked_from keyword [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190684 (https://phabricator.wikimedia.org/T403613) [07:15:52] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190684 (https://phabricator.wikimedia.org/T403613) (owner: 10Matthias Mullie) [07:16:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [07:16:59] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: add the Charts SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1190620 (https://phabricator.wikimedia.org/T399613) (owner: 10Elukey) [07:17:09] (03CR) 10TrainBranchBot: "Approved by mlitn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190684 (https://phabricator.wikimedia.org/T403613) (owner: 10Matthias Mullie) [07:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:17:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [07:18:01] (03Merged) 10jenkins-bot: Add MediaSearch custommatch:linked_from keyword [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190684 (https://phabricator.wikimedia.org/T403613) (owner: 10Matthias Mullie) [07:18:32] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1190684|Add MediaSearch custommatch:linked_from keyword (T403613)]] [07:18:39] T403613: Image Browsing: build the API endpoint for images from other wikis - https://phabricator.wikimedia.org/T403613 [07:18:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:19:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [07:20:36] (03CR) 10Majavah: [C:03+2] P:toolforge: legacy_redirector: Update redirects for toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1190726 (https://phabricator.wikimedia.org/T271862) (owner: 10Majavah) [07:23:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:25:49] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1190684|Add MediaSearch custommatch:linked_from keyword (T403613)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:25:55] T403613: Image Browsing: build the API endpoint for images from other wikis - https://phabricator.wikimedia.org/T403613 [07:26:46] !log mlitn@deploy1003 mlitn: Continuing with sync [07:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:31:37] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190684|Add MediaSearch custommatch:linked_from keyword (T403613)]] (duration: 13m 04s) [07:31:42] T403613: Image Browsing: build the API endpoint for images from other wikis - https://phabricator.wikimedia.org/T403613 [07:31:59] sergi0: I'm done syncing - over to you [07:33:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [07:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:35:10] (03PS1) 10Stevemunene: airflow: Setup an instance for wikidata platform team [puppet] - 10https://gerrit.wikimedia.org/r/1190968 (https://phabricator.wikimedia.org/T404073) [07:37:48] (03CR) 10Elukey: [C:03+1] Add maps2012-maps2014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1190923 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:38:45] jmm@cumin2002 drain-node (PID 3670932) is awaiting input [07:41:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [07:41:32] (03CR) 10Elukey: [C:03+1] Nokia: Add DHCP relay and IPv6 RA generation on IRB interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1190732 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [07:41:47] (03CR) 10Elukey: [C:03+2] redfish: improve log_entries for idrac 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189518 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:41:48] (03CR) 10Muehlenhoff: [C:03+2] Enable OSM and waterlines sync on maps1011 [puppet] - 10https://gerrit.wikimedia.org/r/1190826 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:42:22] (03CR) 10Arnaudb: [C:03+1] devtools: clean gitlab hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1190690 (https://phabricator.wikimedia.org/T390948) (owner: 10Jelto) [07:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:44:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11209081 (10phaultfinder) [07:44:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11209080 (10phaultfinder) [07:46:09] (03CR) 10Arnaudb: [C:03+1] gitlab: bump gitlab-settings to v1.12.0 [puppet] - 10https://gerrit.wikimedia.org/r/1190701 (owner: 10Jelto) [07:47:28] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: refactor the class to allow more use cases [puppet] - 10https://gerrit.wikimedia.org/r/1190201 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [07:49:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet [07:49:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet [07:50:45] (03PS1) 10Stevemunene: Add airflow-wikidata namespace in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190974 (https://phabricator.wikimedia.org/T404073) [07:50:47] (03PS1) 10Stevemunene: Define airflow-wikidata PG cluster and airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) [07:51:50] (03PS3) 10Slyngshede: P:netbox new read-only group for Netbox-Next [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) [07:55:03] (03PS1) 10Stevemunene: dns: provision airflow-wikidata domain [dns] - 10https://gerrit.wikimedia.org/r/1190977 (https://phabricator.wikimedia.org/T404073) [07:58:39] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [07:58:45] (03PS1) 10Stevemunene: idp: Register airflow-wikidata IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1190979 (https://phabricator.wikimedia.org/T404073) [08:03:43] fceratto@cumin1002 sanitize-wiki (PID 576367) is awaiting input [08:04:29] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:04:49] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209117 (10phaultfinder) [08:05:02] (03CR) 10Jelto: [C:03+2] gitlab: bump gitlab-settings to v1.12.0 [puppet] - 10https://gerrit.wikimedia.org/r/1190701 (owner: 10Jelto) [08:08:59] (03CR) 10Jelto: [C:03+2] devtools: clean gitlab hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1190690 (https://phabricator.wikimedia.org/T390948) (owner: 10Jelto) [08:09:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [08:10:42] (03PS2) 10Elukey: Decrease the Pyrra window to 4w for AW's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) [08:11:09] (03CR) 10Brouberol: [C:03+1] airflow: Setup an instance for wikidata platform team [puppet] - 10https://gerrit.wikimedia.org/r/1190968 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:11:19] (03CR) 10Brouberol: [C:03+1] dns: provision airflow-wikidata domain [dns] - 10https://gerrit.wikimedia.org/r/1190977 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:11:25] (03PS3) 10Elukey: Decrease the Pyrra window to 4w for AW's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) [08:11:32] (03CR) 10Brouberol: [C:03+1] Add airflow-wikidata namespace in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190974 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:12:26] (03CR) 10Brouberol: [C:03+1] "Don't forget to provision the associated secret, as mentioned in https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kuberne" [puppet] - 10https://gerrit.wikimedia.org/r/1190979 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:13:35] jmm@cumin2002 drain-node (PID 3688067) is awaiting input [08:13:38] !log VACUUM large container dbs on ms-be1066 T377827 [08:13:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [08:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:44] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [08:14:14] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum [08:14:29] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:16:45] (03CR) 10Elukey: [C:03+2] Add a new insetup role for ML GPU hosts [puppet] - 10https://gerrit.wikimedia.org/r/1190585 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [08:16:52] (03CR) 10Elukey: [V:03+1 C:03+2] Enable ROCm 6.4.3 amd-smi on ml-serve{1012,1013} [puppet] - 10https://gerrit.wikimedia.org/r/1189816 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [08:16:52] (03CR) 10Brouberol: Define airflow-wikidata PG cluster and airflow instance (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:21:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [08:21:40] (03CR) 10Jforrester: [C:03+1] Decrease the Pyrra window to 4w for AW's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [08:21:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [08:23:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [08:23:32] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [08:23:45] (03PS4) 10Slyngshede: P:netbox new read-only group for Netbox-Next [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) [08:24:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet [08:24:44] (03PS5) 10Slyngshede: P:netbox new read-only group for Netbox-Next [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) [08:24:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11209152 (10phaultfinder) [08:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11209153 (10phaultfinder) [08:26:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [08:29:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209164 (10phaultfinder) [08:33:49] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie [08:36:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [08:37:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet [08:38:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [08:38:26] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [08:40:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11209195 (10phaultfinder) [08:40:42] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11209198 (10MatthewVernon) ms-be1066 alerted again today, with `/dev/sda3` 96% full. Ran the previous VACUUM rune again, now back to 89... [08:40:56] !log failover Ganeti master in magru to ganeti3005 [08:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] !log failover Ganeti master in esams to ganeti3005 [08:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [08:44:22] PROBLEM - ganeti-wconfd running on ganeti3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:44:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11209228 (10phaultfinder) [08:45:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [08:46:29] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [08:51:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [08:52:12] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11209231 (10elukey) Hi folks! I created two dashboards for charts: Rolling windows: https://slo.wikimedia.org/?search=charts Calendar windows: * https://grafana-rw.wikimedia.org/d/ccssRIenz/slo-quarter... [08:52:14] (03CR) 10Cathal Mooney: [C:03+2] Nokia: Add DHCP relay and IPv6 RA generation on IRB interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1190732 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:53:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:34] (03Merged) 10jenkins-bot: Nokia: Add DHCP relay and IPv6 RA generation on IRB interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1190732 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:54:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [08:56:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [08:57:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [08:59:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1095 [08:59:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1095 [09:00:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [09:00:42] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [09:01:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [09:04:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:04:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Decommission the 46 hadoop workers and 2 namenode servers that were planned for the hadoop-backup cluster - https://phabricator.wikimedia.org/T404970#11209249 (10Jclark-ctr) [09:05:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209252 (10phaultfinder) [09:05:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [09:05:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Decommission the 46 hadoop workers and 2 namenode servers that were planned for the hadoop-backup cluster - https://phabricator.wikimedia.org/T404970#11209253 (10Jclark-ctr) 05Open→03Resolved [09:06:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [09:06:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [09:06:28] jclark@cumin1002 netbox (PID 690928) is awaiting input [09:06:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:08:38] 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: cloudcephosd1025 won't reimage - https://phabricator.wikimedia.org/T405258#11209262 (10Jclark-ctr) 05Open→03Resolved a:05dcaro→03Jclark-ctr Server completed Reimage by andrew [09:10:08] !log installing qemu security updates [09:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:59] (03PS6) 10Slyngshede: P:netbox new read-only group for Netbox-Next [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) [09:13:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:52] (03CR) 10Slyngshede: [C:03+2] P:netbox new read-only group for Netbox-Next [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [09:15:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. After merging, please also add the group at https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access" [puppet] - 10https://gerrit.wikimedia.org/r/1190575 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [09:15:55] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1013.eqiad.wmnet with OS trixie [09:17:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [09:17:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:18:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:18:55] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:19:56] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:22:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [09:29:13] (03PS7) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [09:29:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209290 (10phaultfinder) [09:30:48] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11209293 (10MoritzMuehlenhoff) [09:33:17] !log upgrading perf on bookworm nodes to 6.1.153 [09:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:38] (03PS1) 10Cathal Mooney: Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) [09:35:13] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11209303 (10phaultfinder) [09:38:38] (03PS2) 10Cathal Mooney: Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) [09:41:25] (03PS3) 10Cathal Mooney: Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) [09:42:47] (03PS4) 10Cathal Mooney: Nokia: ESI-LAG configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) [09:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [09:45:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [09:46:24] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209323 (10fgiunchedi) I can confirm this is still the case, `Profil... [09:49:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209341 (10phaultfinder) [09:50:01] jmm@cumin2002 drain-node (PID 3744199) is awaiting input [09:52:25] ml-staging-etcd2003 will do down for a Ganeti node reboot [09:52:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [09:53:56] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [09:54:42] PROBLEM - Host ml-staging-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:56:10] RECOVERY - Host ml-staging-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.80 ms [09:57:36] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [09:57:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [09:58:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [09:58:07] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [09:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1000) [10:00:14] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:00:41] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [10:00:52] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:01:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [10:01:42] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:01:45] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209379 (10taavi) My understanding is that the reason for this is th... [10:02:07] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:03:57] (03PS1) 10Elukey: profile::amd_gpu: add libdrm-amdgpu1 when running Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1190989 (https://phabricator.wikimedia.org/T403697) [10:04:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [10:04:42] (03CR) 10Klausman: [C:03+1] profile::amd_gpu: add libdrm-amdgpu1 when running Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1190989 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [10:05:17] (03CR) 10Clément Goubert: [C:03+2] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [10:05:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [10:07:07] (03Merged) 10jenkins-bot: {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190376 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [10:07:48] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: add libdrm-amdgpu1 when running Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1190989 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [10:09:55] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:10:09] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:10:45] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:10:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [10:11:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [10:11:06] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:12:37] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:12:57] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:14:25] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:14:41] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:16:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:17:23] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:17:33] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:19:54] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:20:39] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:21:02] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:22:19] (03PS1) 10Kosta Harlan: WIP hCaptcha: Define configuration for A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) [10:22:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [10:22:32] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [10:23:26] (03CR) 10Kosta Harlan: "@phuedx@wikimedia.org @mpopov@wikimedia.org does the ExperimentManager code look correct to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [10:23:34] !log Upgraded envoy to v1.29.12 on api-gateway and rest-gateway - T403663 [10:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:40] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [10:24:42] (03CR) 10Clément Goubert: [C:03+1] "Will deploy after the switchover" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190377 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [10:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:27:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [10:27:32] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [10:28:12] (03CR) 10Marco Fossati: "Thanks for the heads-up @jkilani@wikimedia.org. Would it be fine if I deployed outside the calendar instead, i.e., when no other deploymen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [10:28:15] (03CR) 10Reedy: "I know this is WIP, but frwiki also needs to be in wmgEnableHCaptcha so that the hCaptcha extension is loaded (parent patch or in this one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [10:29:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11209437 (10phaultfinder) [10:29:54] (03CR) 10Kosta Harlan: "Thanks, yes I just wanted to illustrate how the Metrics Platform code would integrate with the new hook for ConfirmEdit, but might as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [10:29:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209438 (10phaultfinder) [10:30:22] (03PS1) 10Esanders: DbFactory: Use primary DB when running maintenance scripts [extensions/Flow] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190994 (https://phabricator.wikimedia.org/T405080) [10:30:37] (03PS1) 10Esanders: DbFactory: Use primary DB when running maintenance scripts [extensions/Flow] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190995 (https://phabricator.wikimedia.org/T405080) [10:31:06] (03PS2) 10Kosta Harlan: WIP hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) [10:31:09] (03CR) 10Elukey: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [10:32:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Flow] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190994 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [10:33:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Flow] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190995 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [10:34:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:35:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11209446 (10phaultfinder) [10:35:47] Hi folks, just wanted to confirm whether it's fine if I deploy a config patch (https://gerrit.wikimedia.org/r/1187413) today when no other deployments are happening. It was scheduled for today's UTC afternoon backport window, which got canceled. [10:42:37] (03CR) 10Ozge: [C:03+1] modules/admin: Add Özge and Bartosz to the ml-lab-users group [puppet] - 10https://gerrit.wikimedia.org/r/1190682 (owner: 10Klausman) [10:43:21] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11209467 (10MatthewVernon) Does `/boot` even need to be on a separate partition for UEFI booting? [10:44:27] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:44:36] jouncebot: nowandnext [10:44:36] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1000) [10:44:36] In 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1100) [10:45:24] mfossati: Yeah, it's usually not an issue if there's SRE's around [10:45:52] cool, thanks! [10:46:48] (03CR) 10Clément Goubert: [C:03+1] trafficserver: set test2wiki to be 50% served by the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1190696 (https://phabricator.wikimedia.org/T405367) (owner: 10Hnowlan) [10:55:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11209491 (10phaultfinder) [10:55:31] @Reedy FYI I plan to deploy after Services and before DC Switchover. [10:56:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [10:57:31] (03PS1) 10Slyngshede: P:idp swap nda for netbox-readonly-access in Netbox-OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1190997 (https://phabricator.wikimedia.org/T404494) [10:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209511 (10phaultfinder) [11:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1100). [11:00:36] ml-staging-etcd2002 and dse-k8s-ctrl2001 will do down for a Ganeti node reboot [11:00:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [11:01:44] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:32] PROBLEM - Host dse-k8s-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:58] (03PS2) 10Slyngshede: P:idp swap nda for netbox-readonly-access in Netbox-OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1190997 (https://phabricator.wikimedia.org/T404494) [11:05:30] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [11:05:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1190997 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [11:06:00] RECOVERY - Host dse-k8s-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [11:06:03] (03CR) 10Hnowlan: [C:03+2] trafficserver: set test2wiki to be 50% served by the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1190696 (https://phabricator.wikimedia.org/T405367) (owner: 10Hnowlan) [11:06:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [11:06:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [11:06:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [11:07:13] (03CR) 10Muehlenhoff: [C:03+2] Add maps2012-maps2014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1190923 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:07:33] (03CR) 10Slyngshede: [C:03+2] P:idp swap nda for netbox-readonly-access in Netbox-OIDC [puppet] - 10https://gerrit.wikimedia.org/r/1190997 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [11:08:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:09:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11209538 (10phaultfinder) [11:11:53] jmm@cumin2002 drain-node (PID 3784135) is awaiting input [11:12:06] aux-k8s-etcd2005 will do down for a Ganeti node reboot [11:12:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [11:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:13:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:14:22] PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:14:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [11:15:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [11:15:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet [11:15:50] RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [11:18:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:18:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [11:18:31] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:20:03] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209562 (10phaultfinder) [11:20:09] (03PS2) 10Anzx: mswikiquote: set timezone, sitename and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189617 (https://phabricator.wikimedia.org/T404700) [11:20:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189617 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx) [11:20:39] kubestagemaster2004 will do down for a Ganeti node reboot [11:20:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [11:20:46] ack [11:20:47] (03PS1) 10Anzx: mswikiquote: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189616 (https://phabricator.wikimedia.org/T404700) [11:21:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189616 (https://phabricator.wikimedia.org/T404700) (owner: 10Anzx) [11:22:42] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:23:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:26:00] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 30.53 ms [11:26:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [11:26:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet [11:28:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet [11:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:33:40] jmm@cumin2002 drain-node (PID 3795416) is awaiting input [11:34:32] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11209619 (10fgiunchedi) Ok if we have systemd-networkd everywhere the... [11:35:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [11:38:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [11:38:32] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:39:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11209622 (10phaultfinder) [11:39:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11209623 (10phaultfinder) [11:42:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [11:42:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet [11:43:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet [11:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:46:22] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190288 (owner: 10PipelineBot) [11:47:14] jmm@cumin2002 drain-node (PID 3801579) is awaiting input [11:47:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:48:04] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:48:08] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190288 (owner: 10PipelineBot) [11:49:01] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:49:39] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:50:01] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:50:22] (03PS1) 10Majavah: P:wmcs::nfsclient: Allow granular control of mounted volumes [puppet] - 10https://gerrit.wikimedia.org/r/1191005 (https://phabricator.wikimedia.org/T405462) [11:50:50] (03CR) 10CI reject: [V:04-1] P:wmcs::nfsclient: Allow granular control of mounted volumes [puppet] - 10https://gerrit.wikimedia.org/r/1191005 (https://phabricator.wikimedia.org/T405462) (owner: 10Majavah) [11:51:09] (03CR) 10Stevemunene: [C:03+2] airflow: Setup an instance for wikidata platform team [puppet] - 10https://gerrit.wikimedia.org/r/1190968 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:51:20] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:51:49] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:51:54] (03PS2) 10Majavah: P:wmcs::nfsclient: Allow granular control of mounted volumes [puppet] - 10https://gerrit.wikimedia.org/r/1191005 (https://phabricator.wikimedia.org/T405462) [11:52:50] (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-wikidata domain [dns] - 10https://gerrit.wikimedia.org/r/1190977 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:53:13] !log stevemunene@dns1004 START - running authdns-update [11:53:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [11:53:32] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:54:37] !log stevemunene@dns1004 END - running authdns-update [11:55:34] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7031/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1191005 (https://phabricator.wikimedia.org/T405462) (owner: 10Majavah) [11:56:47] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis tokwiki in section s5 [11:57:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet [11:58:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [11:58:32] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:02:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet [12:03:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet [12:04:29] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:04:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet [12:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:08] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209705 (10phaultfinder) [12:05:17] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:08:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet [12:09:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11209722 (10phaultfinder) [12:10:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet [12:13:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [12:13:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet [12:14:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet [12:14:34] jouncebot: nowandnext [12:14:34] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [12:14:34] In 1 hour(s) and 45 minute(s): DC Switchover Day 2: Mediawiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1400) [12:14:35] (03Abandoned) 10Slyngshede: P:idp remove NDA group access from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [12:15:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [12:15:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:01] dse-k8s-etcd2003 will go down for a Ganeti reboot [12:18:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [12:19:48] PROBLEM - Host dse-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [12:23:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet [12:23:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet [12:24:37] (03CR) 10Elukey: "I think we are very close to the final version, but I'd like to wait for the other change with the ad-hoc recording rules before proceedin" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [12:26:00] RECOVERY - Host dse-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms [12:28:54] jmm@cumin2002 drain-node (PID 3820648) is awaiting input [12:29:19] (03PS1) 10Fabfur: varnish: remove Host header normalization [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) [12:30:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11209761 (10phaultfinder) [12:32:38] (03CR) 10Klausman: [V:03+1 C:03+2] modules/admin: Add Özge and Bartosz to the ml-lab-users group [puppet] - 10https://gerrit.wikimedia.org/r/1190682 (owner: 10Klausman) [12:33:19] dse-k8s-ctrl2002, kubestagemaster2003 and ml-etcd2003 will go down for a Ganeti reboot [12:33:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [12:33:36] (03PS1) 10Cathal Mooney: Validators: expand nokia pattern to allow routed sub-interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1191012 (https://phabricator.wikimedia.org/T371088) [12:34:01] (03Abandoned) 10Majavah: P:wmcs::nfsclient: Allow granular control of mounted volumes [puppet] - 10https://gerrit.wikimedia.org/r/1191005 (https://phabricator.wikimedia.org/T405462) (owner: 10Majavah) [12:35:12] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:32] PROBLEM - Host dse-k8s-ctrl2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:42] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:42] Hi folks, I'm about to deploy this config patch: https://gerrit.wikimedia.org/r/c/1187413 [12:38:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [12:38:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [12:39:03] (03Merged) 10jenkins-bot: ReaderExperiments' ImageBrowsing stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [12:39:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet [12:39:31] !log mfossati@deploy1003 Started scap sync-world: Backport for [[gerrit:1187413|ReaderExperiments' ImageBrowsing stream configuration (T403259)]] [12:39:32] FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:39:37] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [12:40:14] FIRING: [2x] ProbeDown: Service dse-k8s-ctrl2002:6443 has failed probes (http_dse_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#dse-k8s-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:34] RECOVERY - Host dse-k8s-ctrl2002 is UP: PING OK - Packet loss = 0%, RTA = 30.63 ms [12:40:41] !incidents [12:40:41] 6787 (ACKED) [2x] ProbeDown sre (dse-k8s-ctrl2002:6443 probes/custom codfw) [12:40:44] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms [12:41:00] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [12:41:34] hmm... got paged for dse-k8s-ctrl2002, but seems it's recovered? [12:41:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [12:43:11] what's up with the page? [12:43:18] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [12:43:41] bblack: affected hosts seem to be ok, uptime on dse-k8s-ctrl2002 is only 3 mins so I suppose they will clear [12:43:51] not sure why it might have rebooted though [12:44:32] RESOLVED: [2x] KubernetesCalicoDown: dse-k8s-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:44:33] > dse-k8s-ctrl2002, kubestagemaster2003 and ml-etcd2003 will go down for a Ganeti reboot [12:44:38] topranks: bblack [12:44:42] jmm@cumin2002 drain-node (PID 3830282) is awaiting input [12:44:55] tappof: thanks yep I suspected as much [12:45:10] resolved the page [12:45:11] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11209820 (10phaultfinder) [12:45:14] RESOLVED: [2x] ProbeDown: Service dse-k8s-ctrl2002:6443 has failed probes (http_dse_k8s_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#dse-k8s-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:32] taking a peek at the host [12:45:45] oh I see, planned ganeti reboot [12:45:54] (03CR) 10Elukey: [C:03+1] Validators: expand nokia pattern to allow routed sub-interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1191012 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [12:45:56] !log mfossati@deploy1003 mfossati: Backport for [[gerrit:1187413|ReaderExperiments' ImageBrowsing stream configuration (T403259)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:46:03] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [12:46:09] yep and I think mor it z is doing others to be aware of [12:46:17] probably we should depool in a way that prevents paging, or downtime [12:46:51] yeah. normally VMs are moved off the ganeti host first, there are a small number hard to do so because of local disks if I remember right [12:46:52] nobody wants to wake up for a planned reboot in our near-future 247 rota :P [12:46:59] but probably should depool [12:47:13] sorry downtime I mean [12:47:23] yeah whichever is easier for the situation I guess [12:49:41] (03PS2) 10Fabfur: varnish: remove Host header normalization [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) [12:51:24] (03CR) 10Slyngshede: [C:03+1] "Please also remember to add dummy secret in labs/private repo, as to not break PCC." [puppet] - 10https://gerrit.wikimedia.org/r/1190979 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [12:51:52] jmm@cumin2002 drain-node (PID 3830282) is awaiting input [12:53:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [12:54:08] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [12:54:39] (03PS5) 10Dzahn: mariadb: replace legacy fact for memorysize [puppet] - 10https://gerrit.wikimedia.org/r/1180999 [12:55:58] (03CR) 10Ladsgroup: [C:03+2] mariadb: replace legacy fact for memorysize [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [12:57:44] aux-k8s-etcd2003 and dse-k8s-etcd2001 are going down for a Ganeti reboot [12:57:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet [12:58:34] !log mfossati@deploy1003 mfossati: Continuing with sync [12:59:10] fceratto@cumin1002 sanitize-wiki (PID 1113457) is awaiting input [12:59:12] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:42] PROBLEM - Host dse-k8s-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:50] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.79 ms [13:01:00] RECOVERY - Host dse-k8s-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms [13:02:09] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469 (10MatthewVernon) 03NEW [13:03:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet [13:03:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [13:03:16] !log update envoyproxy to 1.29.12 on ms-fe nodes T405469 [13:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:24] !log mfossati@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187413|ReaderExperiments' ImageBrowsing stream configuration (T403259)]] (duration: 23m 53s) [13:03:24] T405469: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469 [13:03:30] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [13:04:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [13:04:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209918 (10phaultfinder) [13:05:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [13:05:25] FIRING: SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2046.codfw.wmnet [13:06:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [13:06:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis tokwiki in section s5 [13:07:09] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2046.codfw.wmnet [13:07:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2047.codfw.wmnet [13:08:23] (03CR) 10Cathal Mooney: [C:03+2] Validators: expand nokia pattern to allow routed sub-interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1191012 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:10:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [13:10:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:57] (03Merged) 10jenkins-bot: Validators: expand nokia pattern to allow routed sub-interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1191012 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:11:30] jmm@cumin2002 drain-node (PID 3843293) is awaiting input [13:12:12] (03PS1) 10Andrew Bogott: codfw1 ceph: clean up reef specification [puppet] - 10https://gerrit.wikimedia.org/r/1191017 [13:12:13] (03PS1) 10Andrew Bogott: ceph eqiad1: entire cluster to ceph version reef [puppet] - 10https://gerrit.wikimedia.org/r/1191018 [13:12:15] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:12:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:12:44] ml-staging-etcd2002 will go down for a Ganeti node reboot [13:12:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet [13:12:55] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:13:13] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191017 (owner: 10Andrew Bogott) [13:13:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191018 (owner: 10Andrew Bogott) [13:14:31] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:14:41] PROBLEM - Host ml-staging-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:57] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:15:25] RECOVERY - Host ml-staging-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [13:16:00] (03CR) 10Andrew Bogott: [C:03+2] codfw1 ceph: clean up reef specification [puppet] - 10https://gerrit.wikimedia.org/r/1191017 (owner: 10Andrew Bogott) [13:18:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet [13:18:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2047.codfw.wmnet [13:18:45] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns - cmooney@cumin1003" [13:19:13] fceratto@cumin1002 sanitize-wiki (PID 1151876) is awaiting input [13:20:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:51] (03PS2) 10Andrew Bogott: ceph eqiad1: entire cluster to ceph version reef [puppet] - 10https://gerrit.wikimedia.org/r/1191018 [13:21:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191018 (owner: 10Andrew Bogott) [13:21:49] cmooney@cumin1003 netbox (PID 3988727) is awaiting input [13:22:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:22:09] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis tokwiki in section s5 [13:22:25] FIRING: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:36] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns - cmooney@cumin1003" [13:22:36] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:23] (03CR) 10Andrew Bogott: [C:03+2] ceph eqiad1: entire cluster to ceph version reef [puppet] - 10https://gerrit.wikimedia.org/r/1191018 (owner: 10Andrew Bogott) [13:25:15] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1005.eqiad.wmnet with OS bookworm [13:27:10] !log upgrade Envoy on puppet servers T403663 [13:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [13:27:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:29] !log update envoyproxy to 1.29.12 on thanos-fe nodes T405469 [13:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:35] T405469: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469 [13:28:05] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469#11209972 (10MatthewVernon) [13:34:01] (03CR) 10CDanis: [C:03+1] P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:34:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11209993 (10phaultfinder) [13:37:00] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469#11210009 (10MatthewVernon) [13:37:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:37:21] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [13:37:53] !log update envoyproxy to 1.29.12 on apus rgw nodes T405469 [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:59] T405469: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469 [13:38:49] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:38:49] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/mw-reconcile-fixes on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:41:50] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate new snippet files for reverse range for 2620:0:861:fe17::/64 - cmooney@cumin1003" [13:41:54] (03PS1) 10Cathal Mooney: Reverse include file statement for 2620:0:861:fe17::/64 [dns] - 10https://gerrit.wikimedia.org/r/1191025 (https://phabricator.wikimedia.org/T396063) [13:43:23] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Upgrade swift and ceph frontends to envoy v1.29.12 - https://phabricator.wikimedia.org/T405469#11210027 (10MatthewVernon) 05Open→03Resolved [13:43:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-dev/mw-reconcile-fixes on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:44:09] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [13:44:54] cmooney@cumin1003 netbox (PID 3991698) is awaiting input [13:45:18] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generate new snippet files for reverse range for 2620:0:861:fe17::/64 - cmooney@cumin1003" [13:45:18] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:44] (03PS4) 10CDanis: Decrease the Pyrra window to 4w for AW's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [13:45:47] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [13:46:30] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1005.eqiad.wmnet with reason: host reimage [13:46:33] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephmon1005.eqiad.wmnet with reason: host reimage [13:46:49] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis tokwiki in section s5 [13:47:20] (03CR) 10CDanis: [C:03+1] "lgtm! https://puppet-compiler.wmflabs.org/output/1190284/5025/titan1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [13:47:54] (03CR) 10CI reject: [V:04-1] Decrease the Pyrra window to 4w for AW's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [13:48:17] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis tokwiki in section s5 [13:48:20] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:49:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11210078 (10phaultfinder) [13:49:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11210076 (10phaultfinder) [13:49:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11210077 (10phaultfinder) [13:49:58] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis tokwiki in section s5 [13:51:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Review/cleanup content of /srv/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#11210082 (10MoritzMuehlenhoff) >>! In T364622#11208716, @andrea.denisse wrote: > Hi @MoritzM... [13:53:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [13:53:37] (03PS1) 10Andrew Bogott: Update cloudceph specs to expect reef [puppet] - 10https://gerrit.wikimedia.org/r/1191027 (https://phabricator.wikimedia.org/T404249) [13:54:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11210096 (10Jclark-ctr) [13:55:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [13:55:59] (03CR) 10Andrew Bogott: [C:03+2] Update cloudceph specs to expect reef [puppet] - 10https://gerrit.wikimedia.org/r/1191027 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [13:56:40] (03CR) 10CDanis: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [13:56:40] (03CR) 10David Caro: [C:03+1] "LGTM, feel free to reword the test names to something less confusing" [puppet] - 10https://gerrit.wikimedia.org/r/1191027 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [13:56:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis tokwiki in section s5 [13:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:08] (03CR) 10Elukey: [C:03+2] Decrease the Pyrra window to 4w for AW's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1190284 (https://phabricator.wikimedia.org/T394057) (owner: 10Elukey) [14:00:05] jasmine_, swfrench-wmf, and hnowlan: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for DC Switchover Day 2: Mediawiki. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1400). [14:00:14] 06SRE, 10Wikimedia-Mailing-lists: Request for a mailing list for Moore Wikimedians - https://phabricator.wikimedia.org/T405164#11210158 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done: https://lists.wikimedia.org/postorius/lists/usergroup-moore.lists.wikimedia.org [14:00:43] <_joe_> o/ [14:00:45] o/ [14:00:52] /o/ [14:00:55] <_joe_> \0 [14:00:58] fceratto@cumin1002 sanitize-wiki (PID 1227150) is awaiting input [14:00:58] \o\ [14:01:02] o7 [14:01:06] <_joe_> heyy [14:01:40] (03PS1) 10Cathal Mooney: Nokia: support mixing of L2 and L3 subinterfaces on SR Linux [homer/public] - 10https://gerrit.wikimedia.org/r/1191036 (https://phabricator.wikimedia.org/T402577) [14:02:07] federico3: how long does sanitize-wiki usually take? [14:02:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:35] it's doing just a check, no changes, and takes a minute [14:02:39] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:02:39] (it just ended) [14:02:40] ok cool :) [14:02:45] Just checking [14:03:04] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis tokwiki in section s5 [14:03:23] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:03:36] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:04:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1005.eqiad.wmnet with OS bookworm [14:04:13] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:04:24] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:04:47] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:04:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11210181 (10phaultfinder) [14:05:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:39] FYI, we are targetting 1500UTC for the RO time [14:10:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11210189 (10phaultfinder) [14:10:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11210188 (10phaultfinder) [14:10:43] (03CR) 10Scott French: [C:03+1] "Sounds good - @jkilani@wikimedia.org, in the interest of keeping this patch focused on just the switchover, let's keep things as they are " [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [14:12:15] (03CR) 10Ladsgroup: [C:03+1] wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [14:13:14] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon1006.eqiad.wmnet with OS bookworm [14:15:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:19:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11210220 (10phaultfinder) [14:19:58] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:20:56] (actionable for later) those PDU sensors - we should improve that alert as I can't make out from either the alert or the phab task what sensor/value is actually over the limit and what it means (amps I assume?), and that could be pretty serious [14:21:39] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#11210226 (10Dzahn) [14:22:32] (03CR) 10Ssingh: [C:03+1] Reverse include file statement for 2620:0:861:fe17::/64 [dns] - 10https://gerrit.wikimedia.org/r/1191025 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [14:22:49] (03CR) 10Dzahn: "thank you very much for merging this!" [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [14:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:27:13] http://listen.hatnote.com/#ru,fr,de,pl,bg,pt,hu,sr,wikidata,en << This should cover every section except commons that doesn't have a toggle [14:27:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [14:27:44] (03CR) 10Giuseppe Lavagetto: [C:03+1] P:puppetserver::volatile Include XCheeseScore private repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [14:29:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11210256 (10phaultfinder) [14:30:41] (03CR) 10Dzahn: "I can confirm this fixed the puppet error I got on trixie in a cloud VPS project where I used the mariadb class :)" [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [14:31:56] (03PS1) 10Dzahn: wikistats: add support for PHP8.4 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1191065 [14:32:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [14:32:44] Hi folks, just a reminder that we'll be switching Mediawiki over to codfw shortly. Feel free to ping us if you notice anything [14:33:28] 👓 [14:34:18] (03CR) 10Muehlenhoff: "Rather at this to wmflib::debian_php_version and move wikistats to it" [puppet] - 10https://gerrit.wikimedia.org/r/1191065 (owner: 10Dzahn) [14:34:23] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1006.eqiad.wmnet with reason: host reimage [14:34:44] (03CR) 10Cathal Mooney: [C:03+2] Reverse include file statement for 2620:0:861:fe17::/64 [dns] - 10https://gerrit.wikimedia.org/r/1191025 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [14:34:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11210266 (10phaultfinder) [14:35:05] (03CR) 10Dzahn: "fair enough" [puppet] - 10https://gerrit.wikimedia.org/r/1191065 (owner: 10Dzahn) [14:35:25] FIRING: SystemdUnitFailed: postgresql@15-main.service on maps2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:40] !log jasmine@deploy1003 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891 [14:35:46] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [14:36:10] (03Abandoned) 10Dzahn: wikistats: add support for PHP8.4 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1191065 (owner: 10Dzahn) [14:38:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1006.eqiad.wmnet with reason: host reimage [14:38:32] (03PS1) 10Dzahn: wikistats: use wmflib::debian_php_version() to set PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1191066 [14:39:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191066 (owner: 10Dzahn) [14:39:56] (03PS1) 10Dzahn: wmflib: add trixie support to debian_php_version function [puppet] - 10https://gerrit.wikimedia.org/r/1191067 [14:40:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:27] (03CR) 10Dzahn: "well.. only after this, hah: https://gerrit.wikimedia.org/r/1191067" [puppet] - 10https://gerrit.wikimedia.org/r/1191066 (owner: 10Dzahn) [14:41:23] 06SRE, 10DNS, 10Domains, 06Traffic-Icebox, 07HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071#11210296 (10ssingh) There is work underway by Timo on unifying the mobile and desktop variants for Wikimedia projects; see T214998. There are no pl... [14:42:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [14:43:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [14:43:57] (03CR) 10Elukey: [C:03+1] "I have zero context on what ESI-LAG does, but it looks good from the py perspective :D" [homer/public] - 10https://gerrit.wikimedia.org/r/1190983 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:44:56] (03CR) 10RhinosF1: [C:03+1] wikistats: use wmflib::debian_php_version() to set PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1191066 (owner: 10Dzahn) [14:45:43] (03CR) 10RhinosF1: [C:03+1] wmflib: add trixie support to debian_php_version function [puppet] - 10https://gerrit.wikimedia.org/r/1191067 (owner: 10Dzahn) [14:47:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:47:11] (03CR) 10Dzahn: "RhinosF1: this being merged is also what fixed the puppet issue on wikistats-trixie and got us to the other unrelated fixes" [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [14:47:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [14:49:53] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191069 [14:50:17] 06SRE, 06Traffic: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210334 (10ssingh) [14:51:07] We are starting to run the switchover cookbook cc topranks bblack arnoldokoth tappof [14:51:22] claime: good luck!! [14:51:25] gl all! [14:51:42] good luck 🍀 [14:51:43] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from eqiad to codfw [14:51:46] is there a shared tmux to watch as usual? [14:51:58] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from eqiad to codfw [14:52:01] taavi: Ah, you're not on -sre-private [14:52:09] taavi: giving you the command in query, just a sec [14:52:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [14:52:21] hmm, should I? (and thanks!) [14:52:29] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from eqiad to codfw [14:52:42] taavi: that's not for me to answer, but it's fine :D [14:53:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [14:56:24] (03PS1) 10Ssingh: wikimediafoundation.org: verify search console ownership [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) [14:56:35] (Off-topic: database switchover watch party is not a concept I ever thought I'd see...) [14:57:40] perryprog: listen party is at http://listen.hatnote.com/#ru,fr,de,pl,bg,pt,hu,sr,wikidata,en [14:58:04] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from eqiad to codfw [14:58:08] (03CR) 10Slyngshede: [C:03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) (owner: 10Ssingh) [14:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:35] (03CR) 10CI reject: [V:04-1] wikimediafoundation.org: verify search console ownership [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) (owner: 10Ssingh) [14:58:50] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1006.eqiad.wmnet with OS bookworm [14:59:09] Proceeding with 01-stop-maintenance, Go/NO go? [14:59:14] Go [14:59:14] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for LMorgantini - https://phabricator.wikimedia.org/T405405#11210374 (10LMorgantini-WMF) Thank you! And previously I had followed the instructions here: https://www.mediawiki.org/wiki/Product_Analytics/Superset_Access [14:59:17] lesgoo o> [14:59:20] Go [14:59:28] go! [14:59:33] o> [14:59:38] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw [14:59:38] 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2025/26-Q1): Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478 (10fgiunchedi) 03NEW [14:59:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:01:30] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [15:01:39] !log cmooney@dns2005 START - running authdns-update [15:01:59] Proceeding with read only, final go/no go? :) [15:02:21] gogo [15:02:22] go [15:02:27] * swfrench-wmf thumbs up [15:02:29] 🚀 [15:02:35] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from eqiad to codfw [15:02:35] !log jasmine@cumin1003 MediaWiki read-only period starts at: 2025-09-24 15:02:35.395589 [15:02:36] letsa ga [15:02:41] 🍿 [15:02:57] And we're silent [15:03:01] almost [15:03:04] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from eqiad to codfw [15:03:07] no more music [15:03:07] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from eqiad to codfw [15:03:10] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:03:14] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:03:20] that log failure is a classic :D [15:03:36] Some recent change y'all did has caused vandalism rates to drop off a cliff. Good work; hope it stays that way. [15:03:51] "the" logs [15:03:57] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from eqiad to codfw [15:03:59] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:04:01] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from eqiad to codfw [15:04:03] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:04:33] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:04:35] cmooney@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:04:40] mutante: you don't know about /var/log/the.log ? [15:04:41] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from eqiad to codfw [15:04:44] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from eqiad to codfw [15:04:46] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:04:48] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:04:49] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw [15:04:50] perryprog: yeah we noticed the community wasn’t up to the task *ducks* [15:04:51] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:04:56] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from eqiad to codfw [15:04:58] jasmine@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:05:02] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) (owner: 10Ssingh) [15:05:05] Lucas_WMDE: bold move cotton let's see if it pays off [15:05:07] We're back [15:05:08] music! [15:05:14] \o/ [15:05:15] perryprog: lol is that a read-only joke [15:05:15] 🎉🎉🎉 [15:05:16] !log jasmine@cumin1003 MediaWiki read-only period ends at: 2025-09-24 15:05:16.845948 [15:05:17] edit went through on s7 [15:05:19] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw [15:05:19] Yeah cdanis :) [15:05:28] Finally a resolution to https://bash.toolforge.org/quip/AU8FCPz66snAnmqnLHDj — WAIT NAURR [15:05:57] 💀 [15:06:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:06:25] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from eqiad to codfw [15:06:25] !log root@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: sync [15:06:30] 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2025/26-Q1): cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11210428 (10Andrew) [15:06:53] 1:41 minutes read-only [15:06:53] !log root@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: sync [15:06:55] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from eqiad to codfw [15:06:56] I think that's a new record [15:07:02] oh that's nice [15:07:03] wow [15:07:05] small errors on s6 dbs [15:07:12] Sorry I can't read [15:07:16] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from eqiad to codfw [15:07:17] !log root@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:07:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:20] 2:41 [15:07:25] Still very good [15:07:29] ah, that's more what I am used to [15:07:32] yup, nicely done [15:07:38] nice job team! [15:07:40] not to a worrying level, but keeping an eye [15:07:48] !log root@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:08:21] !log root@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [15:08:32] now same errors on all sections [15:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:11] 17.6KQPS on the most loaded db [15:09:13] !log root@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [15:09:18] !log cmooney@dns2005 START - running authdns-update [15:09:25] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [15:09:52] it's on s7 hosts which we already have a ticket for: T404964 [15:09:52] waiting for metrics to refresh to see if it is only a spike [15:09:53] T404964: The load on s7 is too high - https://phabricator.wikimedia.org/T404964 [15:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11210433 (10phaultfinder) [15:10:17] errors going down, I think [15:10:24] I'll move a host from another section later. I don't think it's anything to be worries about [15:10:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:43] I agree, just monitoring and upating what I see [15:11:00] !log cmooney@dns2005 START - running authdns-update [15:11:12] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from eqiad to codfw [15:11:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:11:26] I wonder if the jobs rushing to catch up cases a bit of an overload [15:11:46] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from eqiad to codfw [15:11:50] 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2025/26-Q1): cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11210449 (10Andrew) 1050 and 1051 won't be pooled immediately, they're being reserved for T405478 [15:11:54] QPS going down [15:12:12] !log cmooney@dns2005 END - running authdns-update [15:12:16] errors almost back to 0 [15:12:34] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:12:38] The jobrunner errors were all them just saying DBs were RO, so that was fine [15:12:38] the small peak was at around ~:08 [15:13:14] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:13:16] claime: you didn't understood, there was a bit of high load, I mention that it could be the jobs catching up on all thing that are not done during the transition [15:13:22] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:13:37] !log !log logs again - no more need to check logs of the log bot log command. optionally relog the missed log lines [15:13:38] so not worring about the transition read only, but the load afterwards [15:13:54] jynus: I didn't misunderstand anything, I wasn't answering what you were saying, I was mentioning that the jobrunner alerts were transient. [15:13:57] in any case, this is for tuning, I have no actionable at ht the moment [15:13:58] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) (owner: 10Ssingh) [15:14:01] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:14:09] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:14:20] claime: apologies, then I missunderstood [15:14:26] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:14:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11210460 (10phaultfinder) [15:14:56] All good. [15:15:00] (03CR) 10Jasmine: [C:03+2] wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [15:15:33] !log jasmine@dns1004 START - running authdns-update [15:16:00] (03PS2) 10Ssingh: wikimediafoundation.org: verify search console ownership [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) [15:16:42] !log Phase 9: Update DNS records for new database masters [15:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:23] !log jasmine@dns1004 END - running authdns-update [15:17:44] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from eqiad to codfw [15:18:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [15:18:26] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:18:42] (03CR) 10Ssingh: [C:03+2] wikimediafoundation.org: verify search console ownership [dns] - 10https://gerrit.wikimedia.org/r/1191071 (https://phabricator.wikimedia.org/T404974) (owner: 10Ssingh) [15:18:49] !log sukhe@dns1004 START - running authdns-update [15:20:05] !log sukhe@dns1004 END - running authdns-update [15:20:06] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11210497 (10phaultfinder) [15:23:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [15:23:26] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:26:34] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210509 (10ssingh) 05Open→03Resolved a:03ssingh @JKelsoteel-WMF `wikimediafoundation.org` is now verified. If you are unable to set the permissi... [15:29:42] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from eqiad to codfw [15:32:51] (03CR) 10Jasmine: [C:03+2] geo-maps: update map default to list codfw first [dns] - 10https://gerrit.wikimedia.org/r/1189598 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [15:33:20] !log jasmine@dns1004 START - running authdns-update [15:34:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:40] !log jasmine@dns1004 END - running authdns-update [15:36:03] (03CR) 10Jasmine: [C:03+2] debug.json: order codfw (primary) DC backends first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190293 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [15:36:50] (03Merged) 10jenkins-bot: debug.json: order codfw (primary) DC backends first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190293 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [15:37:19] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11210576 (10cmooney) [15:39:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11210593 (10phaultfinder) [15:40:22] !log jasmine@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891 (duration: 64m 41s) [15:40:28] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [15:40:43] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11210598 (10cmooney) As discussed in today's meeting I believe all the cloudcephosd hosts have jumbo frames enabled on all their physical interfaces. So there should be n... [15:41:19] !log jasmine@deploy1003 Started scap sync-world: Backport for [[gerrit:1190293|debug.json: order codfw (primary) DC backends first (T399891)]] [15:41:45] 06SRE, 06Traffic: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210610 (10JKelsoteel-WMF) 05Resolved→03Open Hi @ssingh , I just tested this with our service account (I am part of ITS), and I am still seeing the window prompting me to v... [15:42:57] 06SRE, 06Traffic: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210616 (10ssingh) >>! In T404974#11210610, @JKelsoteel-WMF wrote: > Hi @ssingh , I just tested this with our service account (I am part of ITS), and I am still seeing the wind... [15:43:55] 06SRE, 06Traffic: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210617 (10JKelsoteel-WMF) No problem. Here it is: google-site-verification=m7jEgoI4DOUy0u6cebxtp7oJT7s3nnNyPWgmPQmNEjc [15:44:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191067 (owner: 10Dzahn) [15:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:44:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11210620 (10phaultfinder) [15:45:39] 06SRE, 06Traffic: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210622 (10ssingh) >>! In T404974#11210617, @JKelsoteel-WMF wrote: > No problem. Here it is: google-site-verification=m7jEgoI4DOUy0u6cebxtp7oJT7s3nnNyPWgmPQmNEjc Thanks, and... [15:45:44] (03CR) 10Dzahn: [C:03+2] wmflib: add trixie support to debian_php_version function [puppet] - 10https://gerrit.wikimedia.org/r/1191067 (owner: 10Dzahn) [15:45:50] (03PS2) 10Dzahn: wmflib: add trixie support to debian_php_version function [puppet] - 10https://gerrit.wikimedia.org/r/1191067 [15:47:51] !log jasmine@deploy1003 jasmine: Backport for [[gerrit:1190293|debug.json: order codfw (primary) DC backends first (T399891)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:47:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405403#11210625 (10Dzahn) p:05Triage→03High [15:47:57] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [15:48:15] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11210627 (10Dzahn) p:05Triage→03High [15:48:28] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11210628 (10Dzahn) p:05Triage→03High [15:48:32] (03PS1) 10Ssingh: wikimediafoundation.org: verify ITS account for search console [dns] - 10https://gerrit.wikimedia.org/r/1191081 (https://phabricator.wikimedia.org/T404974) [15:49:03] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210631 (10JKelsoteel-WMF) Yes, I am using the ITS service account designated to grant others access to our various GSC properties. The account is oktaser... [15:49:09] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11210632 (10Dzahn) p:05Triage→03High [15:49:22] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11210634 (10matmarex) I'm no... [15:49:48] (03PS1) 10Muehlenhoff: Record LDAP access for lmorgantini [puppet] - 10https://gerrit.wikimedia.org/r/1191082 [15:49:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11210636 (10phaultfinder) [15:50:30] !log jasmine@deploy1003 jasmine: Continuing with sync [15:51:05] (03CR) 10Ssingh: [C:03+2] wikimediafoundation.org: verify ITS account for search console [dns] - 10https://gerrit.wikimedia.org/r/1191081 (https://phabricator.wikimedia.org/T404974) (owner: 10Ssingh) [15:51:14] !log sukhe@dns1004 START - running authdns-update [15:51:32] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for lmorgantini [puppet] - 10https://gerrit.wikimedia.org/r/1191082 (owner: 10Muehlenhoff) [15:52:31] !log sukhe@dns1004 END - running authdns-update [15:53:55] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210655 (10ssingh) >>! In T404974#11210631, @JKelsoteel-WMF wrote: > Yes, I am using the ITS service account designated to grant others access to our vari... [15:55:09] !log jasmine@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190293|debug.json: order codfw (primary) DC backends first (T399891)]] (duration: 13m 49s) [15:55:15] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [15:56:32] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210662 (10JKelsoteel-WMF) Understood, thank you! Going forward, as part of the process for handling these requests, once the verification is completed fo... [15:57:46] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210667 (10ssingh) >>! In T404974#11210662, @JKelsoteel-WMF wrote: > Understood, thank you! Going forward, as part of the process for handling these reque... [15:57:53] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210668 (10JKelsoteel-WMF) Confirming it is working for our service account now. 👍 Thank you! [15:58:17] Just got "An error has occurred while searching: Search is currently too busy. Please try again later." on WMC [15:58:43] am consistently getting it actually [15:58:59] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210672 (10ssingh) OK great, resolving this for now but yeah, let's add this to the docs and see how it serves us next time. Thanks and sorry for the... [15:59:05] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210674 (10ssingh) 05Open→03Resolved [15:59:51] 06SRE, 06Traffic, 13Patch-For-Review: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974#11210677 (10JKelsoteel-WMF) Thank you all! And no worries, appreciate the help! [15:59:54] okay now I'm not. 🤔 [16:00:14] I don't have any request ID unfortunately [16:00:56] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11210686 (10kostajh) >>! In... [16:01:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11210690 (10Ottomata) [16:02:13] 06SRE, 10Wikimedia-Mailing-lists: Reports of unsubscribe from wikitech-ambassadors failing to work - https://phabricator.wikimedia.org/T405153#11210694 (10Dzahn) A general comment. If you have issues with a specific list, please try mailing -owner@lists.wikimedia.org to reach them directly. [16:03:51] perryprog: might be T405394 / T405396, not sure [16:03:52] T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394 [16:03:52] T405396: Re-enable performance governor on Cirrussearch hosts - https://phabricator.wikimedia.org/T405396 [16:03:59] word [16:04:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:04:50] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:59] 06SRE, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Dadar Gulung (September 8 - September 26)), 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11210701 (10matmarex) It see... [16:06:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:08:23] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11210719 (10Dzahn) Hi, the group you are requesting, `analytics-privatedata-users` can be used in [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analyti... [16:08:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11210724 (10Dzahn) [16:10:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for BTracy-WMF - https://phabricator.wikimedia.org/T405366#11210745 (10Dzahn) p:05Triage→03Medium [16:11:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:15:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:20:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:21:26] (03PS1) 10Andrew Bogott: Cloudcephosd1050: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) [16:23:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.725s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:23:48] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:24:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:25:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [16:28:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.148s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:28:24] (03PS1) 10Dzahn: admin: add Kerberos to existing shell user sd [puppet] - 10https://gerrit.wikimedia.org/r/1191088 (https://phabricator.wikimedia.org/T405219) [16:28:40] (03CR) 10Dzahn: [C:03+2] wmflib: add trixie support to debian_php_version function [puppet] - 10https://gerrit.wikimedia.org/r/1191067 (owner: 10Dzahn) [16:29:33] (03CR) 10Dzahn: [C:03+2] admin: add Kerberos to existing shell user sd [puppet] - 10https://gerrit.wikimedia.org/r/1191088 (https://phabricator.wikimedia.org/T405219) (owner: 10Dzahn) [16:31:54] (03CR) 10Andrew Bogott: "totally untested" [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [16:32:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11210906 (10Dzahn) Hey @SD0001 please check your email. You should have received on with further instr... [16:34:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11210925 (10Dzahn) 05Open→03In progress [16:35:16] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11210933 (10Dzahn) [16:35:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:39] (03PS1) 10Zabe: maintain-views: Hide rev_sha1 from wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) [16:41:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11210977 (10Dzahn) Hi Data Engineering, the requesting user appears to already have access to superset but with the exception of some dashboards. Tag... [16:41:06] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:41:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:41:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11210980 (10Dzahn) p:05Triage→03Medium [16:42:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:59] (03PS2) 10Zabe: maintain-views: Hide rev_sha1 and ar_sha1 from wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) [16:44:26] (03CR) 10Ottomata: ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [16:44:58] 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495 (10phaultfinder) 03NEW [16:45:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11210994 (10phaultfinder) [16:45:42] (03CR) 10Scott French: [C:03+1] envoyproxy, services_proxy: Update configuration for Envoy 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [16:47:13] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11211038 (10Dzahn) p:05Triage→03High [16:47:20] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11211039 (10Dzahn) p:05Triage→03High [16:47:24] (03Abandoned) 10Scott French: shellbox-syntaxhighlight: revert eqiad to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111309 (owner: 10Scott French) [16:47:30] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11211040 (10Dzahn) p:05Triage→03High [16:48:20] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11211052 (10Dzahn) [16:50:14] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405376#11211057 (10phaultfinder) [16:51:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499 (10cmooney) 03NEW p:05Triage→03Medium [16:52:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:52:29] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11211098 (10Dzahn) tagging Data Engineering for visibility per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-users Data Engineering, can... [16:52:32] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:57] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11211101 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [16:53:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11211105 (10Dzahn) [16:54:15] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211114 (10Dzahn) Hey @thcipriani there is a pending request for deployment access over here. Cheers [16:54:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:55:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211119 (10Dzahn) note: this is a contractor email address..so we have to add an expiration date [16:56:48] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211122 (10Dzahn) @AMarkossyan-WMF Hello, it seems like ebomani is a contractor (based on the email address format used to sign our L3 agreement). This means we need to add an expiry date an... [16:57:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:57:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211124 (10Dzahn) [16:58:24] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211137 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [16:58:53] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211143 (10EBomani) Hi @Dzahn , I joined as a contractor but converted to full-time about a year ago. I am not sure why it still indicates that and not sure what to do to fix it. Please inf... [16:59:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:59:30] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681#11211146 (10Dzahn) Fixing the tags since this is not an LDAP group request. [17:01:14] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:03:31] (03PS1) 10Mstyles: OATHAuth: Increase 2FA opt-in to 20% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191100 (https://phabricator.wikimedia.org/T399664) [17:04:56] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211158 (10Dzahn) Hello @EBomani Gotcha! (and welcome to staff) I see you signed L3 a few days ago on September 19. But that still shows your contractor email address is being used. Did... [17:05:04] 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11211159 (10phaultfinder) [17:05:11] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11211160 (10phaultfinder) [17:05:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681#11211165 (10Dzahn) [17:06:35] (03CR) 10Dzahn: [C:03+1] "Niklas has approved on the ticket and this looks ready to go. (speaking as the clinic duty of this week)" [puppet] - 10https://gerrit.wikimedia.org/r/1189257 (https://phabricator.wikimedia.org/T404681) (owner: 10Cwhite) [17:09:35] (03PS2) 10Krinkle: deployment-prep: update shadowed "default_php_version" overrides [puppet] - 10https://gerrit.wikimedia.org/r/1155793 [17:13:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11211183 (10cmooney) For reference these are the vlans / IPs currently connected: ` lvs1018 - enp94s0f0np0 - vlan1031 - 10.64.130.18/24 - private1-e1-... [17:14:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:14:26] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:15:31] (03CR) 10Dzahn: [C:03+1] "confirmed manager in Dayforce. Also seems like manager approval was not needed anymore since it's wmf staff and this specific group and th" [puppet] - 10https://gerrit.wikimedia.org/r/1189257 (https://phabricator.wikimedia.org/T404681) (owner: 10Cwhite) [17:15:45] (03CR) 10Dzahn: [C:03+2] admin: add huei tan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1189257 (https://phabricator.wikimedia.org/T404681) (owner: 10Cwhite) [17:17:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681#11211187 (10Dzahn) 05Open→03Resolved p:05Triage→03Medium a:03Dzahn Hello @hueitan you have been added to the requested group. Cheers [17:18:00] FIRING: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681#11211200 (10Dzahn) P.S. If you followed specific docs to create this and it told you to tag this as an LDAP-access-request please change that to "SRE-Acces... [17:19:12] (03CR) 10Dzahn: [C:03+2] wikistats: use wmflib::debian_php_version() to set PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1191066 (owner: 10Dzahn) [17:20:38] (03PS2) 10Dzahn: wikistats: use wmflib::debian_php_version() to set PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1191066 [17:21:40] (03PS1) 10Cathal Mooney: lvs1018: remove L2 sub-interface config for row E/F vlans [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) [17:22:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11211218 (10cmooney) [17:22:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Tidy up lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11211217 (10cmooney) [17:24:14] 06SRE, 06collaboration-services, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11211243 (10Dzahn) p:05Triage→03High This seems like it's treated as High priority. That being said, I am not sure if clinic duty is still supposed to make decisions on priorit... [17:24:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:26:06] (03CR) 10Dzahn: [C:03+2] wikistats: use wmflib::debian_php_version() to set PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1191066 (owner: 10Dzahn) [17:27:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:31:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211253 (10thcipriani) >>! In T405124#11211113, @Dzahn wrote: > Hey @thcipriani > > there is a pending request for deployment access over here. > > Cheers Approved! @EBomani let's meet u... [17:34:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11211276 (10phaultfinder) [17:35:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:35:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11211278 (10Aklapper) >>! In T405124#11211158, @Dzahn wrote: > I see you signed L3 a few days ago on September 19. But that still shows your contractor email address is being used. Indeed, `... [17:36:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:44:50] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [17:46:50] 06SRE, 06collaboration-services, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11211325 (10MoritzMuehlenhoff) >>! In T403663#11211243, @Dzahn wrote: > That being said, I am not sure if clinic duty is still supposed to make decisions on priority at all. It's n... [17:47:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [17:48:18] (03CR) 10Dzahn: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1190690 (https://phabricator.wikimedia.org/T390948) (owner: 10Jelto) [17:51:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:53:00] RESOLVED: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:43] (03CR) 10BCornwall: [C:03+2] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [17:55:11] (03PS1) 10Ladsgroup: ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1191130 [17:56:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [17:57:38] (03PS1) 10Ladsgroup: wikimedia.org: Add list.wikimedia.org pointing to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1191131 [17:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:05] brennen and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T1800). [18:00:46] o/ [18:03:50] choo choo [18:07:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:07:57] !log 1.45.0-wmf.20 train status (T396381): logs ok, no current blockers. rolling to group1. [18:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:03] T396381: 1.45.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T396381 [18:08:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:08:39] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191133 (https://phabricator.wikimedia.org/T396381) [18:08:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191133 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot) [18:09:27] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on Wikivoyage and Wikiversity [puppet] - 10https://gerrit.wikimedia.org/r/1191134 (https://phabricator.wikimedia.org/T403510) [18:09:27] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Wikivoyage and Wikiversity (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191135 (https://phabricator.wikimedia.org/T403510) [18:09:34] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191133 (https://phabricator.wikimedia.org/T396381) (owner: 10TrainBranchBot) [18:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:19:54] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:20:08] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11211449 (10phaultfinder) [18:20:51] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.20 refs T396381 [18:20:57] T396381: 1.45.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T396381 [18:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:25:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11211462 (10phaultfinder) [18:25:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405495#11211461 (10phaultfinder) [18:33:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:35:19] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11211475 (10Ottomata) Perhaps https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels helps? I think what is needed is just "Shell account added to analytics-pri... [18:37:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:38:04] (03PS1) 10Marco Fossati: Fix typo in ReaderExperiments' ImageBrowsing stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191144 (https://phabricator.wikimedia.org/T403259) [18:39:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11211482 (10phaultfinder) [18:41:42] (03PS1) 10JHathaway: CHANGELOG: add changelogs for release v11.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191145 [18:43:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:42] (03CR) 10Dr0ptp4kt: [C:03+1] Fix typo in ReaderExperiments' ImageBrowsing stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191144 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [18:45:50] (03CR) 10BCornwall: [C:03+1] ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [18:46:36] (03CR) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [18:48:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11211520 (10phaultfinder) [18:53:37] (03CR) 10BCornwall: [C:03+1] wikimedia.org: Add list.wikimedia.org pointing to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1191131 (owner: 10Ladsgroup) [18:57:38] (03CR) 10JHathaway: [C:03+2] CHANGELOG: add changelogs for release v11.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1191145 (owner: 10JHathaway) [18:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:26] (03CR) 10Ssingh: [C:03+1] "No strong opinions but since changing to spaces requires more work, I think we can just keep it as it is." [puppet] - 10https://gerrit.wikimedia.org/r/1180255 (owner: 10BCornwall) [19:02:19] (03CR) 10Cathal Mooney: "overall lgtm, one error I noticed in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [19:03:02] (03CR) 10BCornwall: [V:03+2 C:03+2] ncredir: update vim modeline options for dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180255 (owner: 10BCornwall) [19:03:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211565 (10Novem_Linguae) [19:04:58] (03CR) 10Cathal Mooney: Cloudcephosd1050: Configure ceph with a single nic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [19:05:50] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211572 (10Dzahn) Thanks @Ottomata I appreciate the recommendation and will go with the minimum level above the current access. Though there is a bit more c... [19:06:14] 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11211573 (10cmooney) I can confirm the switch is already set to accept tagged traffic for t... [19:07:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211580 (10Dzahn) We still have to move the user account from "ldap_only" to "that other section" (which is actually shell access for some but not for others... [19:08:17] (03PS2) 10BCornwall: ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:08:26] (03CR) 10BCornwall: ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:08:51] (03CR) 10BCornwall: [C:03+1] ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:09:01] (03PS2) 10Andrew Bogott: Cloudcephosd1050: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) [19:09:04] (03CR) 10Andrew Bogott: Cloudcephosd1050: Configure ceph with a single nic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [19:09:18] (03CR) 10Ladsgroup: ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:13:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211610 (10Ottomata) Hm, SSH access != posix shell account (perhaps this is a confusing term?). In order to have a uid to add to a group in data.yaml, they m... [19:14:15] (03PS3) 10Andrew Bogott: Cloudcephosd1050: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) [19:14:44] (03PS3) 10BCornwall: ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:14:49] (03CR) 10Ladsgroup: [V:03+2 C:03+2] ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:14:49] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Wikivoyage and Wikiversity [puppet] - 10https://gerrit.wikimedia.org/r/1191134 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:14:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11211615 (10phaultfinder) [19:15:12] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add list.wikimedia.org pointing to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1191131 (owner: 10Ladsgroup) [19:15:22] !log ladsgroup@dns1004 START - running authdns-update [19:15:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [19:16:36] (03CR) 10Andrew Bogott: Cloudcephosd1050: Configure ceph with a single nic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [19:16:44] !log ladsgroup@dns1004 END - running authdns-update [19:16:58] (03PS1) 10Dzahn: admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) [19:17:11] (03CR) 10BCornwall: ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:17:13] (03CR) 10CI reject: [V:04-1] admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) (owner: 10Dzahn) [19:23:43] (03CR) 10Ladsgroup: [V:03+2 C:03+2] ncredir: Add list.wikimedia.org redirecting to lists.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191130 (owner: 10Ladsgroup) [19:25:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11211670 (10phaultfinder) [19:26:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191144 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [19:26:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) (owner: 10Dzahn) [19:32:48] (03PS2) 10Dzahn: admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) [19:33:23] (03CR) 10CI reject: [V:04-1] admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) (owner: 10Dzahn) [19:37:00] (03PS3) 10Dzahn: admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) [19:37:44] (03CR) 10CI reject: [V:04-1] admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) (owner: 10Dzahn) [19:39:38] (03PS4) 10Dzahn: admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) [19:41:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:41:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:42:03] (03CR) 10Dzahn: "though the only reason these files are in their own universe is because we never changed them when every other file was changed and then s" [puppet] - 10https://gerrit.wikimedia.org/r/1180255 (owner: 10BCornwall) [19:43:15] (03CR) 10Dzahn: [C:03+2] admin: upgrade ericmill from ldap_only to a-privatedata with no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191147 (https://phabricator.wikimedia.org/T404903) (owner: 10Dzahn) [19:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:44:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11211728 (10phaultfinder) [19:51:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:51:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:27] I can also self-deploy [20:01:27] FIRING: BFDdown: BFD session down between cr2-eqord and 208.80.154.208 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:01:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190994 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [20:01:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190995 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [20:02:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211749 (10Dzahn) All that being said, @EMill-WMF , you have been upgraded from "level 1" to "level 2" in this: https://wikitech.wikim... [20:02:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:03:56] (03Merged) 10jenkins-bot: DbFactory: Use primary DB when running maintenance scripts [extensions/Flow] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190994 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [20:04:10] (03Merged) 10jenkins-bot: DbFactory: Use primary DB when running maintenance scripts [extensions/Flow] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1190995 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [20:04:53] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1190994|DbFactory: Use primary DB when running maintenance scripts (T405080)]], [[gerrit:1190995|DbFactory: Use primary DB when running maintenance scripts (T405080)]] [20:04:53] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [20:04:53] FIRING: [11x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11211759 (10phaultfinder) [20:05:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211760 (10Dzahn) 05In progress→03Open a:03EMill-WMF Could you confirm if things you expected to work are working now? Thanks! [20:06:10] RESOLVED: BFDdown: BFD session down between cr2-eqord and 208.80.154.208 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:07:09] !log Remove kibana.discovery.wmnet from Puppet CA - T364622 [20:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:15] T364622: Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622 [20:08:33] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#11211768 (10andrea.denisse) >>! In T364622#11210082, @MoritzMuehlenhoff wrote: >>>! In T... [20:09:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405380#11211782 (10phaultfinder) [20:10:44] !log esanders@deploy1003 esanders: Backport for [[gerrit:1190994|DbFactory: Use primary DB when running maintenance scripts (T405080)]], [[gerrit:1190995|DbFactory: Use primary DB when running maintenance scripts (T405080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:50] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [20:11:30] !log esanders@deploy1003 esanders: Continuing with sync [20:12:43] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [20:13:12] 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11211785 (10Andrew) @fgiunchedi, 1050 and 1051 should already be fully puppetized with Ceph... [20:15:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for ericmill - https://phabricator.wikimedia.org/T404903#11211802 (10Novem_Linguae) FYI, I've filed {T405517}, which might be a good spot to continue that part of the discussion. [20:16:07] (03PS1) 10JHathaway: Upstream release v11.8.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1191160 [20:16:28] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1190994|DbFactory: Use primary DB when running maintenance scripts (T405080)]], [[gerrit:1190995|DbFactory: Use primary DB when running maintenance scripts (T405080)]] (duration: 11m 45s) [20:17:07] mfossati: all done here [20:17:15] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for sretest2009 - cmooney@cumin1003" [20:17:29] cool cool, I've just pushed the button! :-) [20:17:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191144 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [20:17:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for sretest2009 - cmooney@cumin1003" [20:17:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:58:40] FIRING: SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250924T2200) [22:05:05] 10ops-codfw, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405530 (10phaultfinder) 03NEW [22:09:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212268 (10phaultfinder) [22:10:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11212269 (10phaultfinder) [22:15:58] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:16:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:19:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:19:50] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:19:54] FIRING: PuppetFailure: Puppet has failed on wdqs2016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:20:58] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:24:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212309 (10phaultfinder) [22:25:58] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:30:58] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:35:33] (03CR) 10Zabe: "I guess this needs to be communicated prior to deploying?" [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [22:38:36] (03CR) 10RLazarus: [V:03+1 C:03+2] envoyproxy, services_proxy: Update configuration for Envoy 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1190791 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [22:39:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191135 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [22:40:58] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:40:59] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Wikivoyage and Wikiversity (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191135 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [22:41:29] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1191135|Disable wmgUseMdotRouting on Wikivoyage and Wikiversity (group1) (T403510)]] [22:41:36] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [22:42:17] FIRING: [22x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:44:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405378#11212337 (10phaultfinder) [22:45:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:46:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:48:03] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1191135|Disable wmgUseMdotRouting on Wikivoyage and Wikiversity (group1) (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:48:09] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [22:48:35] !log krinkle@deploy1003 krinkle: Continuing with sync [22:53:23] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1191135|Disable wmgUseMdotRouting on Wikivoyage and Wikiversity (group1) (T403510)]] (duration: 11m 53s) [22:53:30] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [22:56:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:56:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405402#11212379 (10phaultfinder) [23:00:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212380 (10phaultfinder) [23:00:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:00:58] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:03:25] FIRING: [2x] SystemdUnitFailed: dnsmasq.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:03:26] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:08:05] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns an-worker - jclark@cumin1002" [23:08:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns an-worker - jclark@cumin1002" [23:08:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:10:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1209.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:11:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1210.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:11:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1211.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:11:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1210.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:11:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1210.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:12:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1210.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1212.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1213.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1213.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:14:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1210.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:14:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11212443 (10phaultfinder) [23:15:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:15:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:15:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1213.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:15:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:15:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:16:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:16:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:18:44] (03PS1) 10Stevemunene: idp: Add dummy data for airflow-wikidata [labs/private] - 10https://gerrit.wikimedia.org/r/1191190 (https://phabricator.wikimedia.org/T404073) [23:19:40] (03CR) 10Stevemunene: "Done on If86b860a8fbe90a1f76484a8ade81164f120d273" [puppet] - 10https://gerrit.wikimedia.org/r/1190979 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [23:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:21:08] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1215 [23:24:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1215 [23:24:22] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1216 [23:25:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1216 [23:25:49] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1217 [23:27:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1217 [23:27:16] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1218 [23:28:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1218 [23:28:42] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1219 [23:29:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1219 [23:30:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405377#11212492 (10phaultfinder) [23:30:01] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1220 [23:31:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1220 [23:31:13] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1221 [23:32:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1221 [23:32:35] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1222 [23:33:27] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:33:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1222 [23:33:43] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1223 [23:34:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1223 [23:34:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1224 [23:36:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1224 [23:36:05] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1225 [23:37:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1225 [23:37:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1226 [23:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191192 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191192 (owner: 10TrainBranchBot) [23:38:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1226 [23:38:42] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1227 [23:39:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1227 [23:39:54] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1228 [23:40:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1210.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:40:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1212.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:41:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1228 [23:41:29] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1229 [23:41:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1213.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1209.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1211.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:42:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1229 [23:42:50] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1230 [23:43:27] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs2016:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:43:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1230 [23:44:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1215.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:44:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1214.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:45:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1216.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:45:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1217.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:45:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1217.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:46:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1220.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:46:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1217.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:47:01] jclark@cumin1002 provision (PID 1786319) is awaiting input [23:47:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1218.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:47:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1218.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:47:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1219.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:48:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1218.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:49:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1231 [23:50:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1216.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:51:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1231 [23:51:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1216.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:51:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1232 [23:52:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191192 (owner: 10TrainBranchBot) [23:52:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1216.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:52:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1232 [23:53:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1221.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:53:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1222.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:55:41] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11212528 (10Jclark-ctr)