[03:04:50] <icinga-wm_>	 PROBLEM - Check the last execution of refinery-import-page-history-dumps on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:38:38] <icinga-wm_>	 PROBLEM - Disk space on Hadoop worker on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[05:38:42] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[05:40:02] <icinga-wm_>	 PROBLEM - Hadoop DataNode on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:36:59] <elukey>	 !log powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console
[07:37:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:39:40] <elukey>	 1093 seems overloaded https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-worker1093&var-datasource=thanos&var-cluster=analytics
[07:41:58] <elukey>	 !log powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console
[07:41:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:46:02] <icinga-wm_>	 RECOVERY - Hadoop DataNode on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:46:26] <icinga-wm_>	 RECOVERY - Disk space on Hadoop worker on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:46:30] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:50:57] <elukey>	 gooood
[07:54:04] <icinga-wm_>	 PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 12 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen
[07:54:24] <elukey>	 lovely
[07:55:00] <elukey>	 ah snap of course hdfs was replicating the blocks elsewhere, when the workers were down
[07:55:25] <elukey>	 ok will check in ~30 mins to see if things are stabilized
[08:02:27] <joal>	 wow
[08:04:07] <joal>	 summarizing my understanding: an-worker1093 had an issue (disk-space?), then failed, and hadoop started replicated blocks on other workers
[08:06:52] <joal>	 From that chart https://grafana.wikimedia.org/d/000000585/hadoop?panelId=25&fullscreen&orgId=1&from=now-6h&to=now, it seems the problem started with a bunch of data being written to HDFS
[08:20:10] <joal>	 I think I found the culprit: page-history dump conversion kicked off at 7:34 today (https://yarn.wikimedia.org/proxy/application_1592377297555_11779/)
[08:20:30] <joal>	 Ok I feel better now that we know - going in weekend mode again, will check later
[08:37:34] <elukey>	 joal: two workers had issues - 1091/93 - the first was a complete lock-up, the second a little bit more graceful. Seems related to the kernel stalling too much when executing, starving tasks 
[08:37:50] <elukey>	 may be due to some kind of load, every now and then we see these kind of things
[08:38:02] <elukey>	 hdfs was replicating the blocks elsewhere of course :(
[08:38:27] <elukey>	 now the under replicated blocks are zero
[08:38:39] <elukey>	 and the hdfs corrupt blocks are ~3
[08:39:54] <elukey>	 the issue started at around 2:30 UTC (first worker down) then 3:20 UTC (second down) https://grafana.wikimedia.org/d/000000585/hadoop?panelId=41&fullscreen&var-hadoop_cluster=analytics-hadoop&orgId=1&from=1592617700995&to=1592625335981
[08:40:11] <elukey>	 so I don't think it is page history joal 
[08:40:50] <elukey>	 it seems more a weird kernel issue due to some I/O load patterns
[08:59:50] <elukey>	 really strange, anyway just sent an email, it is confusing that the alarms fired with some delay 
[09:00:08] <elukey>	 on the #ops chan I see some alerts happening even before 
[09:00:18] <elukey>	 well weekend time, we'll see on monday :)
[09:00:22] <elukey>	 looks good now
[13:13:28] <icinga-wm_>	 RECOVERY - HDFS corrupt blocks on an-master1001 is OK: (C)5 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen
[19:25:56] <joal>	 thanks for the summary elukey :)
[19:26:31] <joal>	 side note: ottomata was spot-right on June 18th (2 das ago) - Spark 3.0 got released that day :)
[21:55:51] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10NewcomerTasks 1.2, 10Product-Analytics, and 2 others: NewcomerTask EventLogging schema has invalid array items type specification - https://phabricator.wikimedia.org/T255597 (10Tgr) >>! In T255597#6232016, @Ottomata wrote: > Ya valid JSONSchema but not valid f...