[03:04:50] PROBLEM - Check the last execution of refinery-import-page-history-dumps on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:38:38] PROBLEM - Disk space on Hadoop worker on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:38:42] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:40:02] PROBLEM - Hadoop DataNode on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:36:59] !log powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console [07:37:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:39:40] 1093 seems overloaded https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-worker1093&var-datasource=thanos&var-cluster=analytics [07:41:58] !log powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console [07:41:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:46:02] RECOVERY - Hadoop DataNode on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:46:26] RECOVERY - Disk space on Hadoop worker on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:46:30] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:50:57] gooood [07:54:04] PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 12 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [07:54:24] lovely [07:55:00] ah snap of course hdfs was replicating the blocks elsewhere, when the workers were down [07:55:25] ok will check in ~30 mins to see if things are stabilized [08:02:27] wow [08:04:07] summarizing my understanding: an-worker1093 had an issue (disk-space?), then failed, and hadoop started replicated blocks on other workers [08:06:52] From that chart https://grafana.wikimedia.org/d/000000585/hadoop?panelId=25&fullscreen&orgId=1&from=now-6h&to=now, it seems the problem started with a bunch of data being written to HDFS [08:20:10] I think I found the culprit: page-history dump conversion kicked off at 7:34 today (https://yarn.wikimedia.org/proxy/application_1592377297555_11779/) [08:20:30] Ok I feel better now that we know - going in weekend mode again, will check later [08:37:34] joal: two workers had issues - 1091/93 - the first was a complete lock-up, the second a little bit more graceful. Seems related to the kernel stalling too much when executing, starving tasks [08:37:50] may be due to some kind of load, every now and then we see these kind of things [08:38:02] hdfs was replicating the blocks elsewhere of course :( [08:38:27] now the under replicated blocks are zero [08:38:39] and the hdfs corrupt blocks are ~3 [08:39:54] the issue started at around 2:30 UTC (first worker down) then 3:20 UTC (second down) https://grafana.wikimedia.org/d/000000585/hadoop?panelId=41&fullscreen&var-hadoop_cluster=analytics-hadoop&orgId=1&from=1592617700995&to=1592625335981 [08:40:11] so I don't think it is page history joal [08:40:50] it seems more a weird kernel issue due to some I/O load patterns [08:59:50] really strange, anyway just sent an email, it is confusing that the alarms fired with some delay [09:00:08] on the #ops chan I see some alerts happening even before [09:00:18] well weekend time, we'll see on monday :) [09:00:22] looks good now [13:13:28] RECOVERY - HDFS corrupt blocks on an-master1001 is OK: (C)5 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [19:25:56] thanks for the summary elukey :) [19:26:31] side note: ottomata was spot-right on June 18th (2 das ago) - Spark 3.0 got released that day :) [21:55:51] 10Analytics-EventLogging, 10Analytics-Radar, 10NewcomerTasks 1.2, 10Product-Analytics, and 2 others: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Tgr) >>! In T255597#6232016, @Ottomata wrote: > Ya valid JSONSchema but not valid f...