[16:54:57] we have three worker nodes with disks almost filled up [16:55:06] two of them report unhealthy yarn worker nodes.. [16:55:55] we may need to force the balancer a bit to push more blocks to the new datanodes (they are ~30/35% filled up on avg, the rest of the cluster is 90/95) [17:08:51] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:20:25] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:28:52] PROBLEM - Disk space on Hadoop worker on an-worker1107 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 21 GB (0% inode=99%): /var/lib/hadoop/data/e 16 GB (0% inode=99%): /var/lib/hadoop/data/m 26 GB (0% inode=99%): /var/lib/hadoop/data/g 24 GB (0% inode=99%): /var/lib/hadoop/data/c 24 GB (0% inode=99%): /var/lib/hadoop/data/b 24 GB (0% inode=99%): /var/lib/hadoop/data/i 26 GB (0% inode=99%): /var/lib/hadoop/data/l [20:28:52] 99%): /var/lib/hadoop/data/d 26 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/j 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:44:04] PROBLEM - Disk space on Hadoop worker on an-worker1115 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 25 GB (0% inode=99%): /var/lib/hadoop/data/c 25 GB (0% inode=99%): /var/lib/hadoop/data/b 26 GB (0% inode=99%): /var/lib/hadoop/data/j 23 GB (0% inode=99%): /var/lib/hadoop/data/m 24 GB (0% inode=99%): /var/lib/hadoop/data/g 25 GB (0% inode=99%): /var/lib/hadoop/data/k 24 GB (0% inode=99%): /var/lib/hadoop/data/i [20:44:04] 99%): /var/lib/hadoop/data/f 25 GB (0% inode=99%): /var/lib/hadoop/data/l 25 GB (0% inode=99%): /var/lib/hadoop/data/e 25 GB (0% inode=99%): /var/lib/hadoop/data/d 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:45:44] PROBLEM - Yarn Nodemanagers in unhealthy status on an-master1001 is CRITICAL: 3 ge 3 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Unhealthy_Yarn_Nodemanagers https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=46&fullscreen [20:46:13] !log Force a run of mediawiki-history-drop-snapshot.service to clean up some data [20:46:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:48:20] PROBLEM - Disk space on Hadoop worker on an-worker1115 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 25 GB (0% inode=99%): /var/lib/hadoop/data/c 25 GB (0% inode=99%): /var/lib/hadoop/data/b 27 GB (0% inode=99%): /var/lib/hadoop/data/j 24 GB (0% inode=99%): /var/lib/hadoop/data/m 25 GB (0% inode=99%): /var/lib/hadoop/data/g 25 GB (0% inode=99%): /var/lib/hadoop/data/k 24 GB (0% inode=99%): /var/lib/hadoop/data/i [20:48:20] 99%): /var/lib/hadoop/data/f 24 GB (0% inode=99%): /var/lib/hadoop/data/l 26 GB (0% inode=99%): /var/lib/hadoop/data/e 26 GB (0% inode=99%): /var/lib/hadoop/data/d 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:49:43] !log Manually clean some data ( mediawiki-history-drop-snapshot.service seems not working) [20:49:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:50:40] RECOVERY - Disk space on Hadoop worker on an-worker1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:51:05] Ok we should be fine for now on - let's recombine on this tomorrow [20:52:48] RECOVERY - Disk space on Hadoop worker on an-worker1107 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:54:44] RECOVERY - Yarn Nodemanagers in unhealthy status on an-master1001 is OK: (C)3 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Unhealthy_Yarn_Nodemanagers https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=46&fullscreen