[00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912400 [00:39:17] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912400 (owner: 10TrainBranchBot) [00:56:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912400 (owner: 10TrainBranchBot) [00:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:06:49] 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10jcrespo) I've left the full error at /home/jynus/corrupt_index on both hosts (although it should be also on the error log): ` db1106# mariadb -t -e "SHOW REPLICA STATUS\G" > /home/jynus/corrupt_index... [02:01:33] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [02:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:00:03] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.28% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:13:07] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:05] 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) Don't do anything. I want to report this upstream [04:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [05:39:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:49:46] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:54:46] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:57:35] PROBLEM - Host ores1002 is DOWN: PING CRITICAL - Packet loss = 100% [06:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230430T0700) [07:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.237% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:42:03] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:19] !log powercycle ores1002 (mgmt console tty not usable, host frozen) [08:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:47] RECOVERY - Host ores1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [08:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:56:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [09:58:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [11:07:54] (03PS1) 10Majavah: openstack: envscript: do not set a default for clouds_file [puppet] - 10https://gerrit.wikimedia.org/r/913660 (https://phabricator.wikimedia.org/T330759) [11:07:56] (03PS1) 10Majavah: openstack: envscript: update default port [puppet] - 10https://gerrit.wikimedia.org/r/913661 [11:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.143% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:09:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40966/console" [puppet] - 10https://gerrit.wikimedia.org/r/913661 (owner: 10Majavah) [11:21:10] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): Set cookie in Varnish to start a probe - https://phabricator.wikimedia.org/T335637 (10JameelKaisar) [11:41:54] (03PS1) 10Ladsgroup: db1132: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/913662 (https://phabricator.wikimedia.org/T335632) [11:42:36] 10SRE, 10DBA, 10Patch-For-Review: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Ladsgroup) a:05Ladsgroup→03Marostegui >>! In T335632#8815044, @Marostegui wrote: > Don't do anything. I want to report this upstream Awesome. Less work for me :P [12:06:04] (03PS1) 10Majavah: P:toolforge::prometheus: enable mod_rewrite [puppet] - 10https://gerrit.wikimedia.org/r/913664 [12:08:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40967/console" [puppet] - 10https://gerrit.wikimedia.org/r/913664 (owner: 10Majavah) [12:16:15] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:17:47] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:26:55] PROBLEM - Host db2184 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:53] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:41:03] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:13:03] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Marostegui) @Papaul can we order a DIMM? [14:16:14] (03PS1) 10Marostegui: db2184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/913672 (https://phabricator.wikimedia.org/T335640) [14:17:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2184.codfw.wmnet with reason: Host down T335640 [14:17:30] T335640: db2184 down - https://phabricator.wikimedia.org/T335640 [14:17:35] (03CR) 10Marostegui: [C: 03+2] db2184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/913672 (https://phabricator.wikimedia.org/T335640) (owner: 10Marostegui) [14:17:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2184.codfw.wmnet with reason: Host down T335640 [14:19:00] ACKNOWLEDGEMENT - SSH on db2184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui https://phabricator.wikimedia.org/T335640 https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:19:00] ACKNOWLEDGEMENT - Host db2184 is DOWN: PING CRITICAL - Packet loss = 100% Marostegui https://phabricator.wikimedia.org/T335640 [14:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [15:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.05% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [17:46:13] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Papaul) a:03Jhancock.wm [17:46:59] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Papaul) @Jhancock.wm can you order a DIMM using your Dell Tech direct account? Thanks [18:16:20] (03PS1) 10Raymond Ndibe: webservice: prefix wmcs-project where neccessary [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/913681 (https://phabricator.wikimedia.org/T334657) [18:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [19:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.957% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:14:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:26:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:25:29] PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:31:01] (03PS1) 10Andrew Bogott: mwopenstackclients3.py: add the ability to load auth creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913684 [21:38:31] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [22:53:35] if you just got paged it's yesterday's incident, I really thought I resolved it but apparently not [22:53:39] nothing new needs doing [23:08:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.863% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:32:57] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 54436 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [23:35:13] (SystemdUnitFailed) firing: puppet-agent-timer.service Failed on apifeatureusage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:03] PROBLEM - Check systemd state on apifeatureusage2001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:07] RECOVERY - Check systemd state on apifeatureusage2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state