[00:15:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [00:40:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926548 [00:40:11] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926548 (owner: 10TrainBranchBot) [00:42:13] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926548 (owner: 10TrainBranchBot) [01:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:27:45] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:14] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [04:53:16] (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/925015 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [04:53:53] (03Merged) 10jenkins-bot: opentelemetry-collector: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/925015 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus) [07:10:39] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10Nemo_bis) https://switch2osm.org/providers/ has a list of commercial providers you can rely on. [07:11:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:16:53] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [07:18:19] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:41:05] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:00:59] PROBLEM - SSH on cloudbackup2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:39:43] RECOVERY - SSH on cloudbackup2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:18:45] 10SRE, 10Maps: Allow Wikimedia Maps usage on Mobile Application written with Qt - https://phabricator.wikimedia.org/T338083 (10Mohammad.sherafat.41) [11:55:21] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [11:56:49] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:28:45] 10SRE, 10serviceops, 10Performance-Team (Radar): k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10Krinkle) [12:29:35] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wik [12:31:03] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [12:57:09] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 1726 MB (3% inode=97%): /tmp 1726 MB (3% inode=97%): /var/tmp 1726 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [13:28:42] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs2012.codfw.wmnet [13:28:42] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs2012.codfw.wmnet [13:29:59] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:25] PROBLEM - puppet last run on wdqs2012 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:41:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on an-test-worker1001.eqiad.wmnet with reason: Host under testing/upgrade [13:41:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on an-test-worker1001.eqiad.wmnet with reason: Host under testing/upgrade [13:52:03] RECOVERY - puppet last run on wdqs2012 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:16] (03PS1) 10Krinkle: Profiler: Enable excimer-ui in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926670 [16:24:23] (03PS2) 10Krinkle: Profiler: Enable excimer-ui in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926670 [16:26:43] (03CR) 10Krinkle: [C: 03+2] Profiler: Enable excimer-ui in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926670 (owner: 10Krinkle) [16:27:37] (03Merged) 10jenkins-bot: Profiler: Enable excimer-ui in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926670 (owner: 10Krinkle) [16:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [17:05:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:56] 10SRE, 10Anti-Harassment, 10Data-Engineering, 10Traffic, and 2 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947 (10kostajh) [20:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:52:17] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [20:53:41] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3754 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Docker