[00:46:14] 10Analytics, 10Product-Analytics: Direct link generator to reports in Superset has the incorrect hostname - https://phabricator.wikimedia.org/T238461 (10kzimmerman) [04:11:39] 10Analytics, 10Pageviews-Anomaly: Manipulation of pageview statistics - https://phabricator.wikimedia.org/T232992 (10Nuria) Please see: T238357 for upcoming task to label bot spikes as automated traffic, this will address part of this issue and result in more sensical top lists [07:49:53] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 20 GB (0% inode=99%): /var/lib/hadoop/data/d 16 GB (0% inode=99%): /var/lib/hadoop/data/e 17 GB (0% inode=99%): /var/lib/hadoop/data/f 23 GB (0% inode=99%): /var/lib/hadoop/data/c 19 GB (0% inode=99%): /var/lib/hadoop/data/l 23 GB (0% inode=99%): /var/lib/hadoop/data/b 25 GB (0% inode=99%): /var/lib/hadoop/data/k [07:49:53] 99%): /var/lib/hadoop/data/i 25 GB (0% inode=99%): /var/lib/hadoop/data/h 21 GB (0% inode=99%): /var/lib/hadoop/data/m 22 GB (0% inode=99%): /var/lib/hadoop/data/j 17 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:25:27] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 25 GB (0% inode=99%): /var/lib/hadoop/data/b 24 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/k 27 GB (0% inode=99%): /var/lib/hadoop/data/g 26 GB (0% inode=99%): /var/lib/hadoop/data/m 19 GB (0% inode=99%): /var/lib/hadoop/data/c 27 GB (0% inode=99%): /var/lib/hadoop/data/d [08:25:27] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 26 GB (0% inode=99%): /var/lib/hadoop/data/i 26 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:28:19] 10Analytics, 10Pageviews-Anomaly: Manipulation of pageview statistics - https://phabricator.wikimedia.org/T232992 (10Der_Keks) @MusikAnimal, that's what we can see since month: {F31092769} [08:44:15] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 24 GB (0% inode=99%): /var/lib/hadoop/data/b 23 GB (0% inode=99%): /var/lib/hadoop/data/f 27 GB (0% inode=99%): /var/lib/hadoop/data/k 27 GB (0% inode=99%): /var/lib/hadoop/data/g 25 GB (0% inode=99%): /var/lib/hadoop/data/m 18 GB (0% inode=99%): /var/lib/hadoop/data/c 26 GB (0% inode=99%): /var/lib/hadoop/data/d [08:44:15] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 24 GB (0% inode=99%): /var/lib/hadoop/data/i 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:52:49] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/b 22 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/g 30 GB (0% inode=99%): /var/lib/hadoop/data/m 21 GB (0% inode=99%): /var/lib/hadoop/data/c 26 GB (0% inode=99%): /var/lib/hadoop/data/d [08:52:49] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 23 GB (0% inode=99%): /var/lib/hadoop/data/i 23 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:03:03] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 20 GB (0% inode=99%): /var/lib/hadoop/data/b 21 GB (0% inode=99%): /var/lib/hadoop/data/f 23 GB (0% inode=99%): /var/lib/hadoop/data/k 25 GB (0% inode=99%): /var/lib/hadoop/data/g 24 GB (0% inode=99%): /var/lib/hadoop/data/m 21 GB (0% inode=99%): /var/lib/hadoop/data/c 24 GB (0% inode=99%): /var/lib/hadoop/data/d [09:03:03] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 21 GB (0% inode=99%): /var/lib/hadoop/data/i 20 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:44:22] !log systemctl restart hadoop-* on analytics1077 after oom killer [09:44:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:44:29] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:44:29] RECOVERY - Hadoop DataNode on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:46:03] so an-worker109[45] have all datanode disks full [09:50:49] and analytics1077 again showed the issue with oom [09:50:53] RECOVERY - Disk space on Hadoop worker on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:51:17] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:52:14] wow [09:53:16] Rack: /eqiad/D/7 [09:53:17] 10.64.53.33:50010 (analytics1077.eqiad.wmnet) [09:53:17] 10.64.53.36:50010 (an-worker1094.eqiad.wmnet) [09:53:17] 10.64.53.37:50010 (an-worker1095.eqiad.wmnet) [09:54:00] ok so my theory is that analytics1077 down for hours caused all its blocks to be replicated on 94/95 [09:54:44] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&panelId=41&fullscreen&from=now-24h&to=now [09:59:32] I added under Global usage a breakdown of HDFS blocks written/read/removed by datanode [09:59:35] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-6h&to=now [09:59:58] take a look 94/95 now that 77 is up [09:59:59] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-6h&to=now&panelId=96&fullscreen [10:00:03] joal: ---^ [10:02:18] ok all good, need to run some errands but will check later [10:30:34] wow - Thanks elukey for the fast answer!!! [13:24:21] 10Analytics, 10Pageviews-Anomaly: Manipulation of pageview statistics - https://phabricator.wikimedia.org/T232992 (10Superbass) @MusikAnimal The complaints are about the mobile app with it's trending list. I thought it would use the same database as topviews. So, who can remove artickes from the mobile app's... [22:26:55] 10Analytics, 10Pageviews-Anomaly: Manipulation of pageview statistics - https://phabricator.wikimedia.org/T232992 (10MusikAnimal) >>! In T232992#5668846, @Superbass wrote: > @MusikAnimal The complaints are about the mobile app with it's trending list. I thought it would use the same database as topviews. > >... [22:32:38] 10Analytics, 10Pageviews-Anomaly: Manipulation of pageview statistics - https://phabricator.wikimedia.org/T232992 (10Der_Keks)