[05:06:59] 10Analytics, 10Pageviews-API: Add wikimania.wikimedia.org to pageview definition - https://phabricator.wikimedia.org/T216525 (10Billinghurst) Did this get discussed at any point? [12:04:20] PROBLEM - Webrequests Varnishkafka log producer on cp3053 is CRITICAL: connect to address 10.20.0.53 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:18:20] RECOVERY - Webrequests Varnishkafka log producer on cp3053 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:04:40] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:09:08] excuse-me, I think I hammered too strong with spark :( --^ [17:56:46] usercache/joal/appcache/application_1583418280867_12912 , Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded [17:56:49] :P [17:57:38] very weird though that the systemd unit still sees yarn as up [17:57:44] meanwhile no process is around [17:58:12] !log restart hadoop-yarn-nodemanger on an-worker1087 [17:58:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:58:38] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:01:22] ah ok I just checked the systemd unit [18:01:31] Restart=no [18:01:50] ok that explains it [18:02:08] maybe the thought is to avoid auto-restarting and force people to check [18:02:17] but it doesn't make a lot of sense to me [18:02:44] same thing in bigtop [18:02:49] anyway, afk again :) [20:25:12] 10Quarry, 10DBA, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things are now back to normal?