[00:01:22] PROBLEM - Disk space on ores1006 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [00:09:48] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 87.88 ms [00:09:48] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:09:56] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING WARNING - Packet loss = 90%, RTA = 341.82 ms [00:51:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:42] PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:30] PROBLEM - Disk space on ores2003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [01:26:32] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:34:19] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2) [02:36:40] PROBLEM - MariaDB Replica SQL: s1 on db1139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1141, Errmsg: Error There is no such grant defined for user wikiadmin on host 10.% on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:09:32] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:09:44] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:44] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:32] PROBLEM - Host analytics1071 is DOWN: PING CRITICAL - Packet loss = 100% [04:16:22] PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:00] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:30] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:25:22] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:54] RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:34] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:08] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:58] PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:18:20] RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:51:50] !log move ores coredump files from /var/cache/tmp to /srv/coredumps on ores100[6,7,8] and ores2003 to free space on the root partition [09:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:17] I am going to open a task to follow up, need to go now [09:54:24] RECOVERY - Disk space on ores1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [09:56:51] !log powercycle analytics1071, soft lockup stacktraces in the tty [09:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:14] RECOVERY - Host analytics1071 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [10:02:08] RECOVERY - Disk space on ores1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1007&var-datasource=eqiad+prometheus/ops [10:05:36] RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [10:13:02] RECOVERY - Disk space on ores2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [10:47:07] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:52:40] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.64 ms [10:58:41] 10SRE: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10Peachey88) [10:58:56] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:52] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:11] (03PS3) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [11:17:02] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:23:12] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [11:45:16] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:11] !log drop ores coredumps from ores1008 [11:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:28] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:54] (03PS1) 10Elukey: ores::worker: disable coredumps [puppet] - 10https://gerrit.wikimedia.org/r/742193 (https://phabricator.wikimedia.org/T296563) [11:52:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32688/console" [puppet] - 10https://gerrit.wikimedia.org/r/742193 (https://phabricator.wikimedia.org/T296563) (owner: 10Elukey) [11:52:17] (03CR) 10Elukey: [V: 03+1 C: 03+2] ores::worker: disable coredumps [puppet] - 10https://gerrit.wikimedia.org/r/742193 (https://phabricator.wikimedia.org/T296563) (owner: 10Elukey) [11:55:16] !log disable coredumps for ORES celery units (will cause a roll restart of all celeries) - T296563 [11:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:25] 10SRE, 10SRE-swift-storage: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10RhinosF1) [12:07:36] PROBLEM - Disk space on ores1009 is CRITICAL: DISK CRITICAL - free space: / 978 MB (2% inode=96%): /tmp 978 MB (2% inode=96%): /var/tmp 978 MB (2% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [12:10:37] !log drop /var/tmp/core files from ores1009, root partition full [12:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:26] !log drop /var/tmp/core files from ores100[2,4] root partition full [12:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:42] RECOVERY - Disk space on ores1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [12:37:30] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:26] 10SRE, 10SRE-swift-storage: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10ToBeFree) I'll probably revert the file to the previous revision sometime in the next week, as the filesize is a bit excessive and the codec currently lacks hardware decoding... [13:29:16] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:47:58] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [14:02:46] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:08:58] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.58 ms [14:10:40] can anyone downtime that? I think there's a task already [14:12:50] task: T296540 [14:12:51] T296540: logstash2028.mgmt flapping - https://phabricator.wikimedia.org/T296540 [14:39:42] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:14:46] (03PS6) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [15:15:02] (03CR) 10Majavah: openstack: refactor puppetmaster access (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [15:38:55] 10SRE, 10LDAP: Change LDAP username? - https://phabricator.wikimedia.org/T296429 (10Aklapper) [15:47:07] (03PS1) 10Majavah: hieradata: Remove beta hosts from cache_hosts [puppet] - 10https://gerrit.wikimedia.org/r/742210 [15:47:22] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah) [15:50:33] (03PS2) 10Majavah: hieradata: Remove beta hosts from cache_hosts [puppet] - 10https://gerrit.wikimedia.org/r/742210 [15:50:41] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah) [15:52:02] (03PS1) 10Majavah: hieradata: remove old project-proxies [puppet] - 10https://gerrit.wikimedia.org/r/742211 [16:26:18] PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:29:08] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:49] (03PS1) 10Majavah: set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) [16:32:07] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah) [16:35:42] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:52] (03PS1) 10Majavah: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 [16:36:40] (03PS2) 10Majavah: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829) [16:59:26] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:50] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:20] RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:29:20] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:42] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:58] those are new hosts, bug known ^ (I'll silence) [18:58:56] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:46] !log andrew@deploy1002 Started deploy [horizon/deploy@6115b3b]: network UI tests in codfw1dev [19:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:48] !log andrew@deploy1002 Finished deploy [horizon/deploy@6115b3b]: network UI tests in codfw1dev (duration: 02m 01s) [19:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:06] !log andrew@deploy1002 Started deploy [horizon/deploy@6115b3b]: network UI updates for T296548 [19:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:10] T296548: [openstack.horizon] Going to the interfaces tab of puppet-diffs/pcc-db1001 throws 500 - https://phabricator.wikimedia.org/T296548 [19:55:20] !log andrew@deploy1002 Finished deploy [horizon/deploy@6115b3b]: network UI updates for T296548 (duration: 04m 14s) [19:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:23] 10SRE, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Consider removing labnet-users group - https://phabricator.wikimedia.org/T296574 (10Majavah) [22:35:23] 10SRE, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10Harej) [23:35:00] 10SRE, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10AntiCompositeNumber) Some context: https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20211122.txt