[00:15:58] namenodes are back and recovering. HDFS is up and working. Fix is not yet puppetized, and I need to run some tests before doing so. [00:16:16] (So please still wait a bit before starting new jobs) [00:16:31] (But we can bring the system back :-) ) [00:20:54] qchris: you the man [00:21:04] any idea of root cause? (i.e. jobs that we should not run?) [00:29:23] Analytics-Cluster: Hadoop namenodes are again and again dying with error "Java heap space" - https://phabricator.wikimedia.org/T88871#1022069 (QChris) NEW a:QChris [00:58:20] (CR) Nuria: Add funnel-gathering sql and prototype html (1 comment) [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/188601 (owner: Milimetric) [01:17:20] milimetric: you are probably no longer there (good) but if you are let me know if you want help with the cron on stat1003 [01:17:40] namenode changes are puppetzide, merged, and deployed. HDFS is running stable again. [01:17:55] Heap has been doubled for the time being (until ottomata comes back) [01:18:01] Feel free to start your jobs again. [01:18:11] I'll do some write-up for the list. [01:28:46] Analytics-Cluster: Hadoop namenodes are again and again dying with error "Java heap space" - https://phabricator.wikimedia.org/T88871#1022249 (QChris) Open>Resolved HDFS is up and working again. Thanks Andrew Boggott! [01:32:37] !log name nodes died with error "Java heap space" and did not come back up. Bumping heap allowed to resurrect them (See {{PhabT|88871}}). [02:10:53] !log Ran kafka leader re-election as analytics1021 dropped out of it's partition leader role. [02:11:24] Nice day for the cluster :-) [02:14:17] Analytics-General-or-Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://phabricator.wikimedia.org/T71667#1022330 (QChris) Happened again around 2015-02-07T01:06 . Ran leader-election and analytics1021 is back in the game. [04:51:37] Analytics-Kanban, Analytics-Cluster: Estimate roughly of how many users might not have javascript capable/enable browsers - https://phabricator.wikimedia.org/T88560#1022418 (Nuria) Got all data that I needed. Processing it now. [17:32:46] (PS1) QChris: Deduplicate legacy-tsvs webrequest_source dependency information [analytics/refinery] - https://gerrit.wikimedia.org/r/189218 [17:32:48] (PS1) QChris: Add per webrequest_source 5xx tsvs to legacy_tsvs [analytics/refinery] - https://gerrit.wikimedia.org/r/189219