[00:01:22] <icinga-wm>	 PROBLEM - Disk space on ores1006 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops
[00:09:48] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 87.88 ms
[00:09:48] <icinga-wm>	 RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:09:56] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING WARNING - Packet loss = 90%, RTA = 341.82 ms
[00:51:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:53:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:55:42] <icinga-wm>	 PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:30] <icinga-wm>	 PROBLEM - Disk space on ores2003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops
[01:26:32] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[01:34:19] <wikibugs>	 (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2)
[02:36:40] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s1 on db1139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1141, Errmsg: Error There is no such grant defined for user wikiadmin on host 10.% on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:09:32] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:09:44] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:44] <icinga-wm>	 RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:32] <icinga-wm>	 PROBLEM - Host analytics1071 is DOWN: PING CRITICAL - Packet loss = 100%
[04:16:22] <icinga-wm>	 PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:00] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:30] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:25:22] <icinga-wm>	 RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:25:54] <icinga-wm>	 RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:26:34] <icinga-wm>	 RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:08] <icinga-wm>	 RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:58] <icinga-wm>	 PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:18:20] <icinga-wm>	 RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:51:50] <elukey>	 !log move ores coredump files from /var/cache/tmp to /srv/coredumps on ores100[6,7,8] and ores2003 to free space on the root partition
[09:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:17] <elukey>	 I am going to open a task to follow up, need to go now
[09:54:24] <icinga-wm>	 RECOVERY - Disk space on ores1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops
[09:56:51] <elukey>	 !log powercycle analytics1071, soft lockup stacktraces in the tty
[09:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:14] <icinga-wm>	 RECOVERY - Host analytics1071 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[10:02:08] <icinga-wm>	 RECOVERY - Disk space on ores1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1007&var-datasource=eqiad+prometheus/ops
[10:05:36] <icinga-wm>	 RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops
[10:13:02] <icinga-wm>	 RECOVERY - Disk space on ores2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops
[10:47:07] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[10:52:40] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.64 ms
[10:58:41] <wikibugs>	 10SRE: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10Peachey88)
[10:58:56] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:52] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:11] <wikibugs>	 (03PS3) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305
[11:17:02] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[11:23:12] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[11:45:16] <icinga-wm>	 PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:11] <elukey>	 !log drop ores coredumps from ores1008
[11:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:28] <icinga-wm>	 RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:54] <wikibugs>	 (03PS1) 10Elukey: ores::worker: disable coredumps [puppet] - 10https://gerrit.wikimedia.org/r/742193 (https://phabricator.wikimedia.org/T296563)
[11:52:07] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32688/console" [puppet] - 10https://gerrit.wikimedia.org/r/742193 (https://phabricator.wikimedia.org/T296563) (owner: 10Elukey)
[11:52:17] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] ores::worker: disable coredumps [puppet] - 10https://gerrit.wikimedia.org/r/742193 (https://phabricator.wikimedia.org/T296563) (owner: 10Elukey)
[11:55:16] <elukey>	 !log disable coredumps for ORES celery units (will cause a roll restart of all celeries) - T296563
[11:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:25] <wikibugs>	 10SRE, 10SRE-swift-storage: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10RhinosF1)
[12:07:36] <icinga-wm>	 PROBLEM - Disk space on ores1009 is CRITICAL: DISK CRITICAL - free space: / 978 MB (2% inode=96%): /tmp 978 MB (2% inode=96%): /var/tmp 978 MB (2% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops
[12:10:37] <elukey>	 !log drop /var/tmp/core files from ores1009, root partition full
[12:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:26] <elukey>	 !log drop /var/tmp/core files from ores100[2,4] root partition full
[12:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:42] <icinga-wm>	 RECOVERY - Disk space on ores1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops
[12:37:30] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:10:26] <wikibugs>	 10SRE, 10SRE-swift-storage: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10ToBeFree) I'll probably revert the file to the previous revision sometime in the next week, as the filesize is a bit excessive and the codec currently lacks hardware decoding...
[13:29:16] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:47:58] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[14:02:46] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:08:58] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.58 ms
[14:10:40] <majavah>	 can anyone downtime that? I think there's a task already
[14:12:50] <zabe>	 task: T296540
[14:12:51] <stashbot>	 T296540: logstash2028.mgmt flapping - https://phabricator.wikimedia.org/T296540
[14:39:42] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:14:46] <wikibugs>	 (03PS6) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247)
[15:15:02] <wikibugs>	 (03CR) 10Majavah: openstack: refactor puppetmaster access (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[15:38:55] <wikibugs>	 10SRE, 10LDAP: Change LDAP username? - https://phabricator.wikimedia.org/T296429 (10Aklapper)
[15:47:07] <wikibugs>	 (03PS1) 10Majavah: hieradata: Remove beta hosts from cache_hosts [puppet] - 10https://gerrit.wikimedia.org/r/742210
[15:47:22] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah)
[15:50:33] <wikibugs>	 (03PS2) 10Majavah: hieradata: Remove beta hosts from cache_hosts [puppet] - 10https://gerrit.wikimedia.org/r/742210
[15:50:41] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah)
[15:52:02] <wikibugs>	 (03PS1) 10Majavah: hieradata: remove old project-proxies [puppet] - 10https://gerrit.wikimedia.org/r/742211
[16:26:18] <icinga-wm>	 PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:29:08] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:49] <wikibugs>	 (03PS1) 10Majavah: set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829)
[16:32:07] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah)
[16:35:42] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:52] <wikibugs>	 (03PS1) 10Majavah: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214
[16:36:40] <wikibugs>	 (03PS2) 10Majavah: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829)
[16:59:26] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:50] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:27:20] <icinga-wm>	 RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:29:20] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:42] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:58] <godog>	 those are new hosts, bug known ^ (I'll silence)
[18:58:56] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:46] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@6115b3b]: network UI tests in codfw1dev
[19:45:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:48] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@6115b3b]: network UI tests in codfw1dev (duration: 02m 01s)
[19:47:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:06] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@6115b3b]: network UI updates for T296548
[19:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:10] <stashbot>	 T296548: [openstack.horizon] Going to the interfaces tab of puppet-diffs/pcc-db1001 throws 500 - https://phabricator.wikimedia.org/T296548
[19:55:20] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@6115b3b]: network UI updates for T296548 (duration: 04m 14s)
[19:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:23] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Consider removing labnet-users group - https://phabricator.wikimedia.org/T296574 (10Majavah)
[22:35:23] <wikibugs>	 10SRE, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10Harej)
[23:35:00] <wikibugs>	 10SRE, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10AntiCompositeNumber) Some context: https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20211122.txt