[03:58:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Nuria) >IIRC we have some basic alarms on thresholds in Kafka topics, but not in refined event tables. Did the partition checker not alarm when p... [08:47:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) The biggest tables are: ` 36G _log_Popups_16364296_main_48b5_2_1d_B_0.tokudb 50G _log_MediaViewer_10867062_main_5026_2_1d_B_0.tokudb 6... [14:33:01] 10Analytics, 10Analytics-Kanban: Add TLS encryption support to Kafkatee and enable it where possible - https://phabricator.wikimedia.org/T250147 (10elukey) p:05Triage→03Medium [14:33:10] 10Analytics, 10Analytics-Kanban: Add TLS encryption support to Kafkatee and enable it where possible - https://phabricator.wikimedia.org/T250147 (10elukey) [16:15:54] 10Analytics: jmads requesting Kerberos password - https://phabricator.wikimedia.org/T250560 (10jmads) [20:17:46] PROBLEM - Disk space on Hadoop worker on an-worker1082 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:26:24] PROBLEM - Disk space on Hadoop worker on an-worker1090 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:49:09] PROBLEM - Disk space on Hadoop worker on an-worker1088 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:50:05] whattt [20:52:44] on 1082 all disks saturated [20:54:44] RECOVERY - Disk space on Hadoop worker on an-worker1088 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:57:45] I think there is a job that creates a lot of temp files or similar [20:57:46] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=6&fullscreen&orgId=1&refresh=5m&var-server=an-worker1082&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-12h&to=now [21:01:12] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:01:40] ah wait this is bad [21:01:41] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=12&fullscreen&orgId=1&refresh=5m&var-server=an-worker1082&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-2d&to=now [21:01:58] we are already at a dangerous zone, 90% in the past two days [21:02:52] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 15 GB (0% inode=99%): /var/lib/hadoop/data/d 31 GB (0% inode=99%): /var/lib/hadoop/data/l 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:08:28] breakdown is [21:08:28] 213.4 G 641.4 G /tmp [21:08:29] 106.6 T 319.5 T /user [21:08:29] 15.9 T 47.7 T /var [21:08:29] 555.3 T 1.6 P /wmf [21:10:24] joal: are you around by any chance? [21:12:35] ok going to apply some band aid [21:12:53] !log drop /var/log/hadoop-yarn/apps/analytics from hdfs to free space (15.1T replicated) [21:12:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:21:22] RECOVERY - Disk space on Hadoop worker on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:21:45] !log drop /user/{analytics|hdfs}/.Trash/* from hdfs to free space (~100T used) [21:21:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:22:29] RECOVERY - Disk space on Hadoop worker on an-worker1082 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:25:15] !log drop /var/log/hadoop-yarn/apps/analytics-search/* from hdfs to free space (~8T replicated used) [21:25:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:25:32] RECOVERY - Disk space on Hadoop worker on an-worker1090 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:28:52] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:32:14] !log drop /user/analytics-privatedata/.Trash/* from hdfs to free some space (~100G used) [21:32:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:40:58] !log force hdfs-balancer as attempt to redistribute hdfs blocks more evenly to worker nodes (hoping to free the busiest ones) [21:40:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:41:34] ok things seem a little bit more stable, will check tomorrow again