[00:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141420 [00:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141420 (owner: 10TrainBranchBot) [00:28:19] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1141420 (owner: 10TrainBranchBot) [00:31:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:37:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/ec029b5d31610711a938cfb308c230ffb304ec615ad4f6ddaf6b0516d9d13344/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:11:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:24:27] (03CR) 10Hamish: [C:03+1] Add checkuserwiki favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141089 (https://phabricator.wikimedia.org/T393246) (owner: 10Bunnypranav) [01:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:24:36] (03CR) 10TChin: [C:03+1] Stream config for edge uniques on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140784 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:05:40] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141425 [06:07:50] 10ops-eqiad, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296 (10Marostegui) 03NEW [06:08:19] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10789349 (10Marostegui) 05Open→03Resolved Let's keep a task per crash, so I'm closing this and opened T393296 [06:10:53] 10ops-eqiad, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10789353 (10Marostegui) We cannot keep working and impacting production users like this anymore. We need a proper solution from Dell [06:28:38] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140711 (owner: 10PipelineBot) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250504T0700) [07:12:18] (03CR) 10Dreamy Jazz: Graph: Use new placeholder i18n from WikimediaMessages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [07:18:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:23] (03PS1) 10Novem Linguae: core-Permissions: add move-subpages to enwiki templateeditor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) [07:44:54] (03CR) 10Novem Linguae: "There was one objection to my RFC close, but it appears to have been an outlier. After some discussion at https://en.wikipedia.org/wiki/Wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141034 (https://phabricator.wikimedia.org/T393167) (owner: 10Novem Linguae) [08:27:24] 06SRE, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#10789559 (10Reedy) [08:45:13] 06SRE, 07SRE-Unowned, 07SEO: Index pl.wikinews in Google Publisher Center - https://phabricator.wikimedia.org/T393288#10789580 (10Aklapper) cf T254437 [09:22:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:34] PROBLEM - Disk space on contint1002 is CRITICAL: DISK CRITICAL - free space: /srv 35593 MB (5% inode=85%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=contint1002&var-datasource=eqiad+prometheus/ops [09:52:05] (03PS1) 10Krinkle: actions: Fix handling of redirects to known (non-existing) pages [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141460 [09:52:19] (03CR) 10Krinkle: [C:03+2] actions: Fix handling of redirects to known (non-existing) pages [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141460 (owner: 10Krinkle) [09:52:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141460 (owner: 10Krinkle) [09:56:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:59:08] ^ > "DBUnexpectedError: Database servers in extension1 are overloaded." cc Amir1 [09:59:28] 30,000 in last 15min, mostly jobrunner and api-int [09:59:29] (03CR) 10Jforrester: "Should I0b412ef2aa7c2aac35747f3a4724848a3fee1df6 land as part of this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [10:01:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:05:49] (03Merged) 10jenkins-bot: actions: Fix handling of redirects to known (non-existing) pages [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1141460 (owner: 10Krinkle) [10:06:13] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1141460|actions: Fix handling of redirects to known (non-existing) pages]] [10:13:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:44] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1141460|actions: Fix handling of redirects to known (non-existing) pages]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:26:50] !log krinkle@deploy1003 krinkle: Continuing with sync [10:28:43] FIRING: [7x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:42] FIRING: [7x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:36:35] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141460|actions: Fix handling of redirects to known (non-existing) pages]] (duration: 30m 22s) [11:00:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:59:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:30] PROBLEM - Hadoop NodeManager on an-worker1200 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:04:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:12] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:19:30] RECOVERY - Hadoop NodeManager on an-worker1200 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:28:30] PROBLEM - Hadoop NodeManager on an-worker1203 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:32:30] RECOVERY - Hadoop NodeManager on an-worker1203 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:33:12] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:34:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:32] PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:49:46] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:51] (03PS1) 10Krinkle: noc: Fix "Class MWMultiVersion not found" in wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141516 [12:58:51] (03PS1) 10Krinkle: tests: Remove one-off test-only getDblistsUsedInSettings() and isWikiFamily() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141517 [12:58:51] (03PS1) 10Krinkle: multiversion: Update readDbListFile() calls from alias to WmfConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141518 [13:00:24] (03PS2) 10Krinkle: multiversion: Update readDbListFile() calls from alias to WmfConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141518 [13:00:32] RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:10:34] RECOVERY - Disk space on contint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=contint1002&var-datasource=eqiad+prometheus/ops [13:23:06] (03PS1) 10Krinkle: tests: Replace array_keys(wikiversions.json) with all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141521 [13:23:06] (03PS1) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 [13:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:14] (03PS2) 10Krinkle: multiversion: Move remaining dblist helper to WmfConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 [13:45:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:24:04] (03PS6) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [15:24:22] (03CR) 10Umherirrender: "Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140703 (https://phabricator.wikimedia.org/T393144) (owner: 10Abijeet Patro) [16:18:56] PROBLEM - Hadoop NodeManager on an-worker1192 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:19:56] RECOVERY - Hadoop NodeManager on an-worker1192 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:00] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [16:51:46] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:00] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:56:00] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:00] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:04:46] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:16:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:16:31] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:16:38] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1064-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [18:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 19.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:19:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [18:20:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:33:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:45:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:46:39] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1064-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [18:51:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:51:26] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [18:57:50] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [18:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 17.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:59:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:00:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:02:50] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:03:39] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1057-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:03:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1064* for host appears to have hot shards - bking@cumin2002 [19:04:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1064* for host appears to have hot shards - bking@cumin2002 [19:04:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:04:50] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1057* for host appears to have hot shards - bking@cumin2002 [19:04:54] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1057* for host appears to have hot shards - bking@cumin2002 [19:05:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:10:30] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1057*,elastic1058* for hosts appear to have hot shards - bking@cumin2002 [19:10:36] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1057*,elastic1058* for hosts appear to have hot shards - bking@cumin2002 [19:12:50] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:13:39] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1057-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:17:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:17:50] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:23:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:28:39] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1062-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:35:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1062* for hosts appear to have hot shards - bking@cumin2002 [19:35:49] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1062* for hosts appear to have hot shards - bking@cumin2002 [19:43:29] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1063* for host appears to have hot shards - bking@cumin2002 [19:43:32] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1063* for host appears to have hot shards - bking@cumin2002 [19:43:38] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1062-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:45:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:08] FIRING: [2x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1062-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:54:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:55:09] RESOLVED: [2x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1062-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:58:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:00:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:06:57] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1056*,elastic1063* for host appears to have hot shards - bking@cumin2002 [20:07:04] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1056*,elastic1063* for host appears to have hot shards - bking@cumin2002 [20:08:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:12:09] FIRING: [2x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1056-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [20:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:08] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1056-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [20:24:20] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:25:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:29:50] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [20:29:53] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [21:28:43] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:55] (03PS1) 10Andrew Bogott: ceph_disks.rb: don't try to strip an int [puppet] - 10https://gerrit.wikimedia.org/r/1141548 [22:33:43] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:52:07] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1001.eqiad.wmnet [22:57:29] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [23:01:50] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [23:02:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [23:02:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:02:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1001.eqiad.wmnet [23:02:41] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1002.eqiad.wmnet [23:08:00] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [23:14:24] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [23:15:20] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [23:15:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:15:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1002.eqiad.wmnet [23:16:36] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1003.eqiad.wmnet [23:17:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:20:29] (03PS1) 10Andrew Bogott: Remove puppet refs to cloudcephosd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1141551 (https://phabricator.wikimedia.org/T393196) [23:22:02] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [23:26:58] (03PS2) 10Andrew Bogott: Remove puppet refs to cloudcephosd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1141551 (https://phabricator.wikimedia.org/T393196) [23:27:03] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [23:27:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcephosd1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [23:27:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:27:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1003.eqiad.wmnet [23:27:31] (03CR) 10Andrew Bogott: [C:03+2] Remove puppet refs to cloudcephosd100[123] [puppet] - 10https://gerrit.wikimedia.org/r/1141551 (https://phabricator.wikimedia.org/T393196) (owner: 10Andrew Bogott) [23:29:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: decommission cloudcephosd100[1-3] - https://phabricator.wikimedia.org/T393196#10790607 (10Andrew) a:05Andrew→03None [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1141552 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1141552 (owner: 10TrainBranchBot) [23:51:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1141552 (owner: 10TrainBranchBot)