[00:10:50] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119780 [00:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119780 (owner: 10TrainBranchBot) [00:49:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119780 (owner: 10TrainBranchBot) [00:56:50] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:57:30] PROBLEM - Host cr2-magru is DOWN: PING CRITICAL - Packet loss = 100% [00:57:51] RECOVERY - Host cr2-magru is UP: PING OK - Packet loss = 0%, RTA = 144.81 ms [01:09:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119782 [01:09:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119782 (owner: 10TrainBranchBot) [01:15:43] (03CR) 10Gergő Tisza: [C:03+1] docroot: Add experimental assetlinks.json from and to various domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [01:28:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119782 (owner: 10TrainBranchBot) [01:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 361MiB (2% inode=33%): /tmp 361MiB (2% inode=33%): /var/tmp 361MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [02:05:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [02:08:48] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:38] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:46:02] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:39:30] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [05:34:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10554488 (10phaultfinder) [05:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10554490 (10phaultfinder) [06:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10554494 (10phaultfinder) [06:45:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 352MiB (2% inode=33%): /tmp 352MiB (2% inode=33%): /var/tmp 352MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [06:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:46:06] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:25:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [09:35:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 347MiB (2% inode=33%): /tmp 347MiB (2% inode=33%): /var/tmp 347MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [09:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [10:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:46:02] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:59:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:02:20] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:28] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Provide a base image for Rust, based on Bookworm using 'rustc-web' now at 1.78 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [13:57:44] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [13:59:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:59:31] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:46:02] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:55:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 397MiB (2% inode=32%): /tmp 397MiB (2% inode=32%): /var/tmp 397MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:15:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [16:02:42] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:03:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 393MiB (2% inode=32%): /tmp 393MiB (2% inode=32%): /var/tmp 393MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [16:10:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 5.785 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 5.836 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:52] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:25:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [17:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:22] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1003 is CRITICAL: CRITICAL - itwiki_general_1680969161[6](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[7](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[3](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[0](2025-02-12T15:32:08.556Z), frwiki_general_1680852097[1](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[7](2025-02-12T15:32:08.551Z), frwiki_gener [17:50:22] 52097[3](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[5](2025-02-12T15:32:08.552Z), frwiki_content_1680845169[3](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[5](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[1](2025-02-12T15:32:08.555Z), enwikiversity_content_1693187210[0](2025-02-12T15:32:08.549Z), itwiki_content_1680965136[2](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[6](2025-02-12T15:32:08.551Z), itwiki_ [17:50:22] 1680965136[5](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[0](2025-02-12T15:32:08.551Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10555009 (10phaultfinder) [18:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:46:02] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:33:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:34:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:38:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:43:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:43:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:44:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:45:17] 06SRE, 10Wikimedia-Mailing-lists: Please enable anti-spam measures for Wikitech-l - https://phabricator.wikimedia.org/T386559#10555038 (10Ladsgroup) This usually can be done by the the list admins, I have access but rather leave turning the spamassassin or other knobs to the admins themselves. [20:31:45] 06SRE, 10Wikimedia-Mailing-lists: Please enable anti-spam measures for Wikitech-l - https://phabricator.wikimedia.org/T386559#10555056 (10A_smart_kitten) Pinging @Aklapper & @Platonides FYI, as list admins mentioned on https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ [21:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 376MiB (2% inode=32%): /tmp 376MiB (2% inode=32%): /var/tmp 376MiB (2% inode=32%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [22:12:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [22:45:38] (03PS1) 10Alexandros Kosiaris: Fix name of abstract wiki rest web image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1119831 (https://phabricator.wikimedia.org/T380807) [22:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:46:02] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:47:27] (03PS2) 10Alexandros Kosiaris: Fix name of abstract wiki rust web image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1119831 (https://phabricator.wikimedia.org/T380807) [22:48:23] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Fix name of abstract wiki rust web image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1119831 (https://phabricator.wikimedia.org/T380807) (owner: 10Alexandros Kosiaris) [23:03:02] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1005 is CRITICAL: CRITICAL - frwiki_content_1680845169[1](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[3](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[5](2025-02-12T15:32:08.555Z), itwiki_content_1680965136[2](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[6](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[5](2025-02-12T15:32:08.551Z), itwiki_conte [23:03:02] 65136[0](2025-02-12T15:32:08.551Z), itwiki_general_1680969161[6](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[7](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[3](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[0](2025-02-12T15:32:08.556Z), frwiki_general_1680852097[5](2025-02-12T15:32:08.552Z), frwiki_general_1680852097[1](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[3](2025-02-12T15:32:08.551Z), frwiki_general [23:03:02] 097[7](2025-02-12T15:32:08.551Z), enwikiversity_content_1693187210[0](2025-02-12T15:32:08.549Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:03:02] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1007 is CRITICAL: CRITICAL - itwiki_general_1680969161[3](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[7](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[6](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[0](2025-02-12T15:32:08.556Z), frwiki_general_1680852097[1](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[7](2025-02-12T15:32:08.551Z), frwiki_gener [23:03:02] 52097[3](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[5](2025-02-12T15:32:08.552Z), frwiki_content_1680845169[3](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[5](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[1](2025-02-12T15:32:08.555Z), itwiki_content_1680965136[2](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[6](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[5](2025-02-12T15:32:08.551Z), itwiki_content [23:03:03] 136[0](2025-02-12T15:32:08.551Z), enwikiversity_content_1693187210[0](2025-02-12T15:32:08.549Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:11:21] 06SRE, 10Wikimedia-Mailing-lists: Please enable anti-spam measures for Wikitech-l - https://phabricator.wikimedia.org/T386559#10555125 (10Aklapper) @A_smart_kitten: https://meta.wikimedia.org/wiki/Mailing_lists/Administration#Spam_filters mentions an `X-Spam-Score` header. I don't see any such header in the re... [23:11:38] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1006 is CRITICAL: CRITICAL - itwiki_content_1680965136[2](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[5](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[6](2025-02-12T15:32:08.551Z), itwiki_content_1680965136[0](2025-02-12T15:32:08.551Z), frwiki_content_1680845169[1](2025-02-12T15:32:08.555Z), frwiki_content_1680845169[3](2025-02-12T15:32:08.555Z), frwiki_conte [23:11:38] 45169[5](2025-02-12T15:32:08.555Z), enwikiversity_content_1693187210[0](2025-02-12T15:32:08.549Z), frwiki_general_1680852097[3](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[5](2025-02-12T15:32:08.552Z), frwiki_general_1680852097[7](2025-02-12T15:32:08.551Z), frwiki_general_1680852097[1](2025-02-12T15:32:08.551Z), itwiki_general_1680969161[3](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[6](2025-02-12T15:32:08.556Z), itwiki_ [23:11:38] 1680969161[7](2025-02-12T15:32:08.556Z), itwiki_general_1680969161[0](2025-02-12T15:32:08.556Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:12:07] 06SRE, 10Wikimedia-Mailing-lists: Please enable anti-spam measures for Wikitech-l - https://phabricator.wikimedia.org/T386559#10555126 (10Aklapper) > Perhaps only allow submissions from subscribers That's already the case, isn't it? > and require all subscriptions to be approved? I don't have time for that. [23:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:19:54] (03PS1) 10Aklapper: AVA: Remove another unused variable [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1119834 [23:28:36] (03PS1) 10Aklapper: Move some code much closer to where it's used; fix a comment [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1119837 [23:56:14] (03PS1) 10Aklapper: Use consistent names and prefix for config variables [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1119846