[03:58:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Nuria) >IIRC we have some basic alarms on thresholds in Kafka topics, but not in refined event tables.  Did the partition checker not alarm when p...
[08:47:47] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) The biggest tables are:  ` 36G _log_Popups_16364296_main_48b5_2_1d_B_0.tokudb 50G _log_MediaViewer_10867062_main_5026_2_1d_B_0.tokudb 6...
[14:33:01] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add TLS encryption support to Kafkatee and enable it where possible - https://phabricator.wikimedia.org/T250147 (10elukey) p:05Triage→03Medium
[14:33:10] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add TLS encryption support to Kafkatee and enable it where possible - https://phabricator.wikimedia.org/T250147 (10elukey)
[16:15:54] <wikibugs>	 10Analytics: jmads requesting Kerberos password - https://phabricator.wikimedia.org/T250560 (10jmads)
[20:17:46] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1082 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:26:24] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1090 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:49:09] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1088 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:50:05] <elukey>	 whattt
[20:52:44] <elukey>	 on 1082 all disks saturated
[20:54:44] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1088 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:57:45] <elukey>	 I think there is a job that creates a lot of temp files or similar
[20:57:46] <elukey>	 https://grafana.wikimedia.org/d/000000377/host-overview?panelId=6&fullscreen&orgId=1&refresh=5m&var-server=an-worker1082&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-12h&to=now
[21:01:12] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:01:40] <elukey>	 ah wait this is bad
[21:01:41] <elukey>	 https://grafana.wikimedia.org/d/000000377/host-overview?panelId=12&fullscreen&orgId=1&refresh=5m&var-server=an-worker1082&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-2d&to=now
[21:01:58] <elukey>	 we are already at a dangerous zone, 90% in the past two days
[21:02:52] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 15 GB (0% inode=99%): /var/lib/hadoop/data/d 31 GB (0% inode=99%): /var/lib/hadoop/data/l 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:08:28] <elukey>	 breakdown is
[21:08:28] <elukey>	 213.4 G  641.4 G  /tmp
[21:08:29] <elukey>	 106.6 T  319.5 T  /user
[21:08:29] <elukey>	 15.9 T   47.7 T   /var
[21:08:29] <elukey>	 555.3 T  1.6 P    /wmf
[21:10:24] <elukey>	 joal: are you around by any chance?
[21:12:35] <elukey>	 ok going to apply some band aid
[21:12:53] <elukey>	 !log drop /var/log/hadoop-yarn/apps/analytics from hdfs to free space (15.1T replicated)
[21:12:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:21:22] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:21:45] <elukey>	 !log drop /user/{analytics|hdfs}/.Trash/* from hdfs to free space (~100T used)
[21:21:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:22:29] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1082 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:25:15] <elukey>	 !log drop /var/log/hadoop-yarn/apps/analytics-search/* from hdfs to free space (~8T replicated used)
[21:25:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:25:32] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:28:52] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:32:14] <elukey>	 !log drop /user/analytics-privatedata/.Trash/* from hdfs to free some space (~100G used)
[21:32:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:40:58] <elukey>	 !log force hdfs-balancer as attempt to redistribute hdfs blocks more evenly to worker nodes (hoping to free the busiest ones)
[21:40:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:41:34] <elukey>	 ok things seem a little bit more stable, will check tomorrow again