[00:30:13] jdlrobson: not necessarily, although it certainly would be useful [00:30:22] see what I wrote in T203814#4604373 [00:30:24] T203814: Turn on MinervaErrorLogSamplingRate (Schema:WebClientError) - https://phabricator.wikimedia.org/T203814 [02:03:09] tgr|away: +1 to deploying raven cc jdlrobson [02:20:32] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria) 05Open>03Resolved [02:20:47] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Nuria) 05Open>03Resolved [02:21:02] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Routing code allows invalid routes - https://phabricator.wikimedia.org/T188792 (10Nuria) 05Open>03Resolved [02:21:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Cleanup refinery artifact folder from old jars - https://phabricator.wikimedia.org/T206687 (10Nuria) 05Open>03Resolved [02:21:50] 10Analytics, 10Analytics-Kanban, 10DBA, 10Growth-Team, and 2 others: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10Nuria) 05Open>03Resolved [02:22:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Correct data-removal jobs for mediawiki tables (public and private) - https://phabricator.wikimedia.org/T198600 (10Nuria) 05Open>03Resolved [02:23:21] 10Analytics, 10Analytics-Kanban: eventlogging_db_sanitization script failed - https://phabricator.wikimedia.org/T207165 (10Nuria) 05Open>03Resolved [02:23:33] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: EventLogging sanitization - https://phabricator.wikimedia.org/T199898 (10Nuria) [02:23:37] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: [EL sanitization] Store the old salt for 2 extra weeks - https://phabricator.wikimedia.org/T199900 (10Nuria) 05Open>03Resolved [02:23:57] 10Analytics, 10New-Readers: Instrument the landing page - https://phabricator.wikimedia.org/T202592 (10Prtksxna) [02:24:07] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [02:29:54] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats 2.0: "aa.wikipedia.org" exists and has data available, but marked "Invalid" - https://phabricator.wikimedia.org/T187414 (10Nuria) za.wiktionary.org , usability.wikimedia.org and aa.wikipedia.org are now selectable options... [02:31:13] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats 2.0: "aa.wikipedia.org" exists and has data available, but marked "Invalid" - https://phabricator.wikimedia.org/T187414 (10Nuria) 05Open>03Resolved [02:31:57] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [02:39:47] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [02:41:26] (03CR) 10Nuria: [C: 031] "If tests pass and we have tried to build aqs with success with these changes let's merge and deploy." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467733 (https://phabricator.wikimedia.org/T206474) (owner: 10Fdans) [02:48:24] 10Analytics: events_sanitized could drop columns like recvfrom and sequenceId - https://phabricator.wikimedia.org/T207431 (10Nuria) [02:48:48] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: EventLogging sanitization - https://phabricator.wikimedia.org/T199898 (10Nuria) [02:48:52] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: [EL sanitization] Retroactively sanitize (including hash and salt appInstallId fields) data in the events database - https://phabricator.wikimedia.org/T199902 (10Nuria) 05Open>03Resolved [02:49:11] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Table view of timely results in wikistats 2 should be ordered in time descending - https://phabricator.wikimedia.org/T199693 (10Nuria) [02:49:34] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Performance tweaks for state management in wikistats - https://phabricator.wikimedia.org/T207352 (10Nuria) [02:52:08] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: "Anonymous Editor" is a broken link - https://phabricator.wikimedia.org/T206968 (10Nuria) To fix bug: - need to change and deploy aqs glue code that is substituting IPs by anonymous editors - need to update wikistats UI to just print a st... [02:52:44] 10Analytics, 10Analytics-Wikistats, 10Patch-For-Review: Improvements to Wikistats2 chart popups - https://phabricator.wikimedia.org/T192416 (10Nuria) [02:53:07] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [02:54:21] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics: Attempting to select all columns of mediawiki_history sometimes fails with a cryptic error message - https://phabricator.wikimedia.org/T205367 (10Nuria) 05Open>03Resolved [03:05:08] 10Analytics, 10Analytics-Wikistats: Active Editors metric per project family - https://phabricator.wikimedia.org/T188265 (10Nuria) a:03JAllemandou [03:05:27] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [03:09:57] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [03:11:23] 10Analytics, 10Analytics-Wikistats: Active Editors metric per project family - https://phabricator.wikimedia.org/T188265 (10Nuria) We aim to have this metric deployed on the API by the end of this quarter (December 2018) [03:12:41] 10Analytics, 10Analytics-Wikistats: Active Editors metric per project family - https://phabricator.wikimedia.org/T188265 (10Nuria) [03:12:48] 10Analytics, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats 2.0: allow to view stats for all language versions (a.k.a. Project families) - https://phabricator.wikimedia.org/T188550 (10Nuria) [03:17:48] 10Analytics, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics, 10Patch-For-Review: Decommission edit analysis dashboard - https://phabricator.wikimedia.org/T199340 (10Nuria) The placeholders was a good idea, closing ticket. [03:17:58] 10Analytics, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics, 10Patch-For-Review: Decommission edit analysis dashboard - https://phabricator.wikimedia.org/T199340 (10Nuria) 05Open>03Resolved [03:21:54] 10Analytics, 10Community-Tech, 10Grant-Metrics: Review category queries - https://phabricator.wikimedia.org/T206783 (10Nuria) [03:23:26] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.98e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [03:42:17] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.979e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [05:33:51] good morning yarn [05:34:00] woah a lot of alarms :( [05:37:15] ahhh snap the threshold for the alarm is the old one [05:37:16] sigh [05:43:35] fixing it [05:44:31] 10Analytics, 10Community-Tech, 10Grant-Metrics: Review category queries - https://phabricator.wikimedia.org/T206783 (10Marostegui) If it is not strictly necessary I would rather not create a new index on labs to avoid it drifting too much from production. So if it is possible to split the query into smaller... [05:47:10] but I have an idea about those alarms [05:47:17] for example, if I simply say [05:47:27] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: 1.946e+09 ge 1.946e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [05:47:33] heap usage / heap max > 0.90 -> critical [05:47:35] otherwise no [05:47:53] I get an alarm that takes two metrics and doesn't need thresholds [05:47:59] like in this case (2G -> 4G) [05:50:04] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10elukey) p:05Triage>03High [05:50:43] lovely, from 22 UTC there are 120 errors/s for --^ [05:53:13] can we apply manually what Chelsy is suggesting in https://phabricator.wikimedia.org/T207424#4679134 ? [05:59:06] chelsyx: --^ (99% you are not online but worth a try :) [06:00:53] RECOVERY - YARN active ResourceManager JVM Heap usage on an-master1001 is OK: (C)3.891e+09 ge (W)3.686e+09 ge 1.928e+09 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [06:51:35] fixed https://grafana.wikimedia.org/dashboard/db/hadoop to host all the new metrics with "hadoop_cluster" labels [06:51:53] of course we don't have history in this graph [06:54:32] I added a banner to https://grafana.wikimedia.org/dashboard/db/analytics-hadoop [06:57:29] I took a look to the HDFS Namenode's GC metrics, they are not really super good [06:57:38] Old gen collections are very slow [06:57:39] mmmm [07:03:18] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) a:03ayounsi [07:04:58] Hi elukey [07:05:09] good alarm morning [07:06:40] Bonjour :) [07:06:46] I am changing the jvm alarms [07:06:58] found a way to calculate the avg of the ration between used/max [07:07:31] the current way (my bad) is too error prone [07:07:38] (fixed thresholds) [07:09:58] ack elukey [07:10:29] elukey: EL errors are still coming from ReadingDepthSchema.enable [07:10:50] I'm going to add that little command I devised yesterday to on-call doc [07:12:23] joal: there are also errors for MobileWikiAppiOSUserHistory [07:12:38] I didn't check what is the biggest one since I fixed the Yarn alarms [07:12:39] elukey (from any stat machine): kafkacat -b kafka-jumbo1001 -t eventlogging_EventError -o -10000 -e | sed -n 's/^.*"schema": "\([^"]*\)"}.*$/\1/p' | sort | uniq -c [07:12:50] adding that to on-call :) [07:12:50] ah yes very nice [07:16:02] ah snap I am reading the task now, the fix needs a deployment... [07:16:19] elukey: for bash masters that command is not even needed to written anywhere, but I'm always fighting whenever I need to awk/sed [07:16:44] Docs updated [07:16:52] thanks! [07:21:52] I'd have used | egrep -o "\"schema\": \"(\w*)\"" but it of course prints "schema: etc.." [07:21:56] interesting [07:22:09] I always forget the tricks about these tools :( [07:22:58] ah no even this one is not enough, there are multiple schema: [07:22:59] lol [07:23:08] :) [07:23:30] elukey: the sed one works because of event-fields order being consistent [07:23:58] elukey: without field order consistency, we could match the core schema (EventError) in which we are not interested [07:24:39] yeah I know but my command does't always print both, now I am curious [07:39:16] really weird, with grep -o I cant find a way to say "print only the first match [07:41:40] /dev/mapper/eventlog1002--vg-data 870G 780G 46G 95% /srv [07:41:44] sigh [07:41:52] :/ [07:42:23] elukey: should we copy files on HDFS to temporarilly free some space? [07:44:00] joal: in my mind there is no point in keeping more than say 7 days on eventlog1002 [07:44:07] since on stat1005 we rsync for 90 days [07:44:14] ah right [07:44:16] hm [07:44:45] we currently keep 20d [07:47:07] for example, I just manually removed 3d [07:47:08] /dev/mapper/eventlog1002--vg-data 870G 555G 272G 68% /srv [07:47:22] pff [07:47:35] ok so lemme lower down the retention to 15d now [07:47:43] just to be sure for the weekend [07:47:44] +1 elukey [07:47:54] need to run errand for ~1h30 - will be back [07:48:50] ack! [08:34:37] PROBLEM - YARN NodeManager JVM Heap usage on analytics1060 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:08] PROBLEM - YARN NodeManager JVM Heap usage on analytics1049 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:08] PROBLEM - YARN NodeManager JVM Heap usage on analytics1051 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:08] PROBLEM - YARN NodeManager JVM Heap usage on analytics1044 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:08] PROBLEM - HDFS DataNode JVM Heap usage on analytics1071 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:09] PROBLEM - YARN NodeManager JVM Heap usage on analytics1056 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:12] this is me [08:35:18] PROBLEM - HDFS DataNode JVM Heap usage on analytics1029 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:18] PROBLEM - HDFS DataNode JVM Heap usage on analytics1044 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:18] PROBLEM - HDFS DataNode JVM Heap usage on analytics1032 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:18] PROBLEM - HDFS DataNode JVM Heap usage on analytics1048 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:27] PROBLEM - HDFS DataNode JVM Heap usage on analytics1065 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:37] PROBLEM - HDFS DataNode JVM Heap usage on analytics1073 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:38] PROBLEM - HDFS DataNode JVM Heap usage on analytics1035 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:38] PROBLEM - HDFS DataNode JVM Heap usage on analytics1063 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:38] PROBLEM - YARN NodeManager JVM Heap usage on analytics1046 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:39] PROBLEM - YARN NodeManager JVM Heap usage on analytics1057 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:47] fixing uff... [08:35:47] PROBLEM - YARN NodeManager JVM Heap usage on analytics1062 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:48] PROBLEM - Zookeeper node JVM Heap usage on conf2003 is CRITICAL: bad_data: parse error at char 168: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:35:48] PROBLEM - YARN NodeManager JVM Heap usage on analytics1032 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:48] PROBLEM - HDFS DataNode JVM Heap usage on analytics1046 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:48] PROBLEM - HDFS DataNode JVM Heap usage on analytics1070 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:49] PROBLEM - HDFS DataNode JVM Heap usage on analytics1028 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:49] PROBLEM - YARN NodeManager JVM Heap usage on analytics1061 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:49] PROBLEM - YARN NodeManager JVM Heap usage on analytics1035 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:49] PROBLEM - YARN NodeManager JVM Heap usage on analytics1073 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:57] PROBLEM - Zookeeper node JVM Heap usage on conf1006 is CRITICAL: bad_data: parse error at char 168: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:35:58] PROBLEM - HDFS DataNode JVM Heap usage on analytics1049 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:58] PROBLEM - YARN NodeManager JVM Heap usage on analytics1065 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:58] PROBLEM - HDFS DataNode JVM Heap usage on analytics1060 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:58] PROBLEM - HDFS DataNode JVM Heap usage on analytics1056 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:35:59] PROBLEM - YARN NodeManager JVM Heap usage on analytics1048 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:59] PROBLEM - YARN NodeManager JVM Heap usage on analytics1063 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:35:59] PROBLEM - YARN NodeManager JVM Heap usage on analytics1070 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:36:07] PROBLEM - Zookeeper node JVM Heap usage on druid1002 is CRITICAL: bad_data: parse error at char 170: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:36:08] PROBLEM - YARN NodeManager JVM Heap usage on analytics1071 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [08:36:08] PROBLEM - HDFS DataNode JVM Heap usage on analytics1051 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:36:08] PROBLEM - HDFS DataNode JVM Heap usage on analytics1061 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:36:46] nothing is happening, only a bad prometheus query [08:37:22] unclosed left parenthesis -> /me cries [08:40:38] PROBLEM - HDFS active Namenode JVM Heap usage on an-master1001 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [08:44:17] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1001 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [08:48:42] RECOVERY - HDFS DataNode JVM Heap usage on analytics1028 is OK: (C)0.95 ge (W)0.9 ge 0.3086 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:48:43] RECOVERY - HDFS DataNode JVM Heap usage on analytics1049 is OK: (C)0.95 ge (W)0.9 ge 0.3427 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:48:44] RECOVERY - HDFS DataNode JVM Heap usage on analytics1056 is OK: (C)0.95 ge (W)0.9 ge 0.4827 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:48:44] RECOVERY - HDFS DataNode JVM Heap usage on analytics1060 is OK: (C)0.95 ge (W)0.9 ge 0.3248 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:02] RECOVERY - HDFS DataNode JVM Heap usage on analytics1051 is OK: (C)0.95 ge (W)0.9 ge 0.3205 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:03] RECOVERY - HDFS DataNode JVM Heap usage on analytics1061 is OK: (C)0.95 ge (W)0.9 ge 0.4937 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:04] RECOVERY - HDFS DataNode JVM Heap usage on analytics1071 is OK: (C)0.95 ge (W)0.9 ge 0.3528 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:13] RECOVERY - HDFS DataNode JVM Heap usage on analytics1029 is OK: (C)0.95 ge (W)0.9 ge 0.4981 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:13] RECOVERY - HDFS DataNode JVM Heap usage on analytics1032 is OK: (C)0.95 ge (W)0.9 ge 0.6365 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:13] RECOVERY - HDFS DataNode JVM Heap usage on analytics1044 is OK: (C)0.95 ge (W)0.9 ge 0.3213 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:14] RECOVERY - HDFS DataNode JVM Heap usage on analytics1048 is OK: (C)0.95 ge (W)0.9 ge 0.7637 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:22] RECOVERY - HDFS DataNode JVM Heap usage on analytics1065 is OK: (C)0.95 ge (W)0.9 ge 0.6302 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:49:33] RECOVERY - HDFS DataNode JVM Heap usage on analytics1073 is OK: (C)0.95 ge (W)0.9 ge 0.5265 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:52:13] a-team: all these alarms were my fault, bad prometheus query (missing parenthesis.. sigh) [08:52:16] nothing bad happened [08:52:43] PROBLEM - Zookeeper node JVM Heap usage on druid1001 is CRITICAL: bad_data: parse error at char 170: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:52:52] PROBLEM - Zookeeper node JVM Heap usage on conf2002 is CRITICAL: bad_data: parse error at char 168: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:53:02] PROBLEM - Zookeeper node JVM Heap usage on druid1005 is CRITICAL: bad_data: parse error at char 170: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:53:03] PROBLEM - Zookeeper node JVM Heap usage on conf1005 is CRITICAL: bad_data: parse error at char 168: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:53:03] PROBLEM - Zookeeper node JVM Heap usage on druid1004 is CRITICAL: bad_data: parse error at char 170: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:53:12] PROBLEM - YARN active ResourceManager JVM Heap usage on an-master1002 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [08:53:17] sigh [08:53:32] PROBLEM - Zookeeper node JVM Heap usage on druid1003 is CRITICAL: bad_data: parse error at char 170: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:53:53] RECOVERY - HDFS DataNode JVM Heap usage on analytics1070 is OK: (C)0.95 ge (W)0.9 ge 0.7405 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:56:14] PROBLEM - Zookeeper node JVM Heap usage on druid1006 is CRITICAL: bad_data: parse error at char 170: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:58:03] PROBLEM - HDFS active Namenode JVM Heap usage on an-master1002 is CRITICAL: bad_data: parse error at char 246: unclosed left parenthesis https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [08:58:37] RECOVERY - HDFS active Namenode JVM Heap usage on an-master1001 is OK: (C)0.95 ge (W)0.9 ge 0.8098 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [08:58:47] RECOVERY - Zookeeper node JVM Heap usage on druid1003 is OK: (C)0.95 ge (W)0.9 ge 0.3874 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:58:48] RECOVERY - Zookeeper node JVM Heap usage on conf2003 is OK: (C)0.95 ge (W)0.9 ge 0.06884 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:58:58] RECOVERY - Zookeeper node JVM Heap usage on conf1006 is OK: (C)0.95 ge (W)0.9 ge 0.7651 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:58:58] RECOVERY - Zookeeper node JVM Heap usage on druid1001 is OK: (C)0.95 ge (W)0.9 ge 0.3156 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:59:07] RECOVERY - Zookeeper node JVM Heap usage on conf2002 is OK: (C)0.95 ge (W)0.9 ge 0.1782 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:59:08] RECOVERY - HDFS active Namenode JVM Heap usage on an-master1002 is OK: (C)0.95 ge (W)0.9 ge 0.8026 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=4&fullscreen&orgId=1 [08:59:17] RECOVERY - Zookeeper node JVM Heap usage on druid1005 is OK: (C)0.95 ge (W)0.9 ge 0.2603 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:59:17] RECOVERY - Zookeeper node JVM Heap usage on druid1002 is OK: (C)0.95 ge (W)0.9 ge 0.4235 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:59:18] RECOVERY - Zookeeper node JVM Heap usage on conf1005 is OK: (C)0.95 ge (W)0.9 ge 0.6223 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:59:19] RECOVERY - Zookeeper node JVM Heap usage on druid1004 is OK: (C)0.95 ge (W)0.9 ge 0.3248 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [08:59:28] RECOVERY - HDFS DataNode JVM Heap usage on analytics1035 is OK: (C)0.95 ge (W)0.9 ge 0.3369 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [08:59:35] alleluia [08:59:41] ok we should be good now :D [08:59:44] sorry for the noise [09:01:17] RECOVERY - HDFS DataNode JVM Heap usage on analytics1046 is OK: (C)0.95 ge (W)0.9 ge 0.3134 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [09:01:58] RECOVERY - YARN active ResourceManager JVM Heap usage on an-master1002 is OK: (C)0.9 ge (W)0.7 ge 0.2532 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [09:06:27] PROBLEM - YARN NodeManager JVM Heap usage on analytics1029 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:27] PROBLEM - YARN NodeManager JVM Heap usage on analytics1053 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:27] PROBLEM - YARN NodeManager JVM Heap usage on analytics1045 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:28] PROBLEM - YARN NodeManager JVM Heap usage on analytics1058 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:28] PROBLEM - YARN NodeManager JVM Heap usage on analytics1042 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:29] PROBLEM - YARN NodeManager JVM Heap usage on analytics1052 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:38] PROBLEM - YARN NodeManager JVM Heap usage on analytics1030 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:38] PROBLEM - YARN NodeManager JVM Heap usage on analytics1054 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:38] PROBLEM - YARN NodeManager JVM Heap usage on analytics1031 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:39] PROBLEM - YARN NodeManager JVM Heap usage on analytics1043 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:39] PROBLEM - YARN NodeManager JVM Heap usage on analytics1067 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:39] PROBLEM - YARN NodeManager JVM Heap usage on analytics1055 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:47] RECOVERY - YARN active ResourceManager JVM Heap usage on an-master1001 is OK: (C)0.9 ge (W)0.7 ge 0.4605 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1 [09:06:47] RECOVERY - HDFS DataNode JVM Heap usage on analytics1063 is OK: (C)0.95 ge (W)0.9 ge 0.5751 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen&orgId=1 [09:06:48] PROBLEM - YARN NodeManager JVM Heap usage on analytics1077 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:57] PROBLEM - YARN NodeManager JVM Heap usage on analytics1036 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:57] PROBLEM - YARN NodeManager JVM Heap usage on analytics1050 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:06:58] PROBLEM - YARN NodeManager JVM Heap usage on analytics1047 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:07] PROBLEM - YARN NodeManager JVM Heap usage on analytics1040 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:07] PROBLEM - YARN NodeManager JVM Heap usage on analytics1039 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:09] PROBLEM - YARN NodeManager JVM Heap usage on analytics1064 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:17] PROBLEM - YARN NodeManager JVM Heap usage on analytics1033 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:17] PROBLEM - YARN NodeManager JVM Heap usage on analytics1066 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:17] PROBLEM - YARN NodeManager JVM Heap usage on analytics1076 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:19] PROBLEM - YARN NodeManager JVM Heap usage on analytics1034 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:19] PROBLEM - YARN NodeManager JVM Heap usage on analytics1075 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:27] PROBLEM - YARN NodeManager JVM Heap usage on analytics1059 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:28] PROBLEM - YARN NodeManager JVM Heap usage on analytics1072 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:07:28] PROBLEM - YARN NodeManager JVM Heap usage on analytics1074 is CRITICAL: bad_data: parse error at char 121: missing unit character in duration https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:09:10] this is a nightmare [09:13:08] RECOVERY - Zookeeper node JVM Heap usage on druid1006 is OK: (C)0.95 ge (W)0.9 ge 0.1606 https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=40&fullscreen [09:22:52] RECOVERY - YARN NodeManager JVM Heap usage on analytics1029 is OK: (C)0.95 ge (W)0.9 ge 0.5684 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:52] RECOVERY - YARN NodeManager JVM Heap usage on analytics1053 is OK: (C)0.95 ge (W)0.9 ge 0.7334 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:52] RECOVERY - YARN NodeManager JVM Heap usage on analytics1045 is OK: (C)0.95 ge (W)0.9 ge 0.5852 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:53] RECOVERY - YARN NodeManager JVM Heap usage on analytics1058 is OK: (C)0.95 ge (W)0.9 ge 0.7026 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:53] RECOVERY - YARN NodeManager JVM Heap usage on analytics1071 is OK: (C)0.95 ge (W)0.9 ge 0.627 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:54] RECOVERY - YARN NodeManager JVM Heap usage on analytics1042 is OK: (C)0.95 ge (W)0.9 ge 0.4416 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:54] RECOVERY - YARN NodeManager JVM Heap usage on analytics1052 is OK: (C)0.95 ge (W)0.9 ge 0.4639 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:55] RECOVERY - YARN NodeManager JVM Heap usage on analytics1051 is OK: (C)0.95 ge (W)0.9 ge 0.7832 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:22:55] RECOVERY - YARN NodeManager JVM Heap usage on analytics1049 is OK: (C)0.95 ge (W)0.9 ge 0.5344 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:02] RECOVERY - YARN NodeManager JVM Heap usage on analytics1032 is OK: (C)0.95 ge (W)0.9 ge 0.7666 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:02] RECOVERY - YARN NodeManager JVM Heap usage on analytics1030 is OK: (C)0.95 ge (W)0.9 ge 0.7621 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:03] RECOVERY - YARN NodeManager JVM Heap usage on analytics1054 is OK: (C)0.95 ge (W)0.9 ge 0.4623 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:03] RECOVERY - YARN NodeManager JVM Heap usage on analytics1031 is OK: (C)0.95 ge (W)0.9 ge 0.4111 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:12] RECOVERY - YARN NodeManager JVM Heap usage on analytics1067 is OK: (C)0.95 ge (W)0.9 ge 0.6829 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:12] RECOVERY - YARN NodeManager JVM Heap usage on analytics1043 is OK: (C)0.95 ge (W)0.9 ge 0.7148 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:12] RECOVERY - YARN NodeManager JVM Heap usage on analytics1055 is OK: (C)0.95 ge (W)0.9 ge 0.5133 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:12] RECOVERY - YARN NodeManager JVM Heap usage on analytics1044 is OK: (C)0.95 ge (W)0.9 ge 0.6735 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:14] RECOVERY - YARN NodeManager JVM Heap usage on analytics1077 is OK: (C)0.95 ge (W)0.9 ge 0.4783 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:22] RECOVERY - YARN NodeManager JVM Heap usage on analytics1036 is OK: (C)0.95 ge (W)0.9 ge 0.4453 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:22] RECOVERY - YARN NodeManager JVM Heap usage on analytics1050 is OK: (C)0.95 ge (W)0.9 ge 0.694 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:23] RECOVERY - YARN NodeManager JVM Heap usage on analytics1047 is OK: (C)0.95 ge (W)0.9 ge 0.6143 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:24] RECOVERY - YARN NodeManager JVM Heap usage on analytics1046 is OK: (C)0.95 ge (W)0.9 ge 0.5052 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:24] RECOVERY - YARN NodeManager JVM Heap usage on analytics1057 is OK: (C)0.95 ge (W)0.9 ge 0.538 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:32] RECOVERY - YARN NodeManager JVM Heap usage on analytics1040 is OK: (C)0.95 ge (W)0.9 ge 0.5339 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:33] RECOVERY - YARN NodeManager JVM Heap usage on analytics1035 is OK: (C)0.95 ge (W)0.9 ge 0.5824 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:33] RECOVERY - YARN NodeManager JVM Heap usage on analytics1039 is OK: (C)0.95 ge (W)0.9 ge 0.4456 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:33] RECOVERY - YARN NodeManager JVM Heap usage on analytics1060 is OK: (C)0.95 ge (W)0.9 ge 0.7672 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:33] RECOVERY - YARN NodeManager JVM Heap usage on analytics1073 is OK: (C)0.95 ge (W)0.9 ge 0.4606 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:34] RECOVERY - YARN NodeManager JVM Heap usage on analytics1061 is OK: (C)0.95 ge (W)0.9 ge 0.436 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:43] RECOVERY - YARN NodeManager JVM Heap usage on analytics1064 is OK: (C)0.95 ge (W)0.9 ge 0.4487 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:43] RECOVERY - YARN NodeManager JVM Heap usage on analytics1033 is OK: (C)0.95 ge (W)0.9 ge 0.7849 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:43] RECOVERY - YARN NodeManager JVM Heap usage on analytics1065 is OK: (C)0.95 ge (W)0.9 ge 0.5688 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:44] RECOVERY - YARN NodeManager JVM Heap usage on analytics1076 is OK: (C)0.95 ge (W)0.9 ge 0.6862 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:23:44] RECOVERY - YARN NodeManager JVM Heap usage on analytics1066 is OK: (C)0.95 ge (W)0.9 ge 0.7965 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:24:32] RECOVERY - YARN NodeManager JVM Heap usage on analytics1056 is OK: (C)0.95 ge (W)0.9 ge 0.6464 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:25:32] \o/! seems we are ok :) [09:25:50] sorry for the noise, not having a linter for these changes causes this damages :( [09:26:13] RECOVERY - YARN NodeManager JVM Heap usage on analytics1034 is OK: (C)0.95 ge (W)0.9 ge 0.6666 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:26:14] RECOVERY - YARN NodeManager JVM Heap usage on analytics1059 is OK: (C)0.95 ge (W)0.9 ge 0.4821 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:26:21] the idea is to have an average over the past hour of the heap usage / heap max [09:26:22] RECOVERY - YARN NodeManager JVM Heap usage on analytics1075 is OK: (C)0.95 ge (W)0.9 ge 0.5402 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:26:32] since it is very bumpy [09:26:46] the first version was leading to false positives [09:26:52] so I added a new improved version [09:26:56] without a ) and a m [09:27:00] sigh [09:27:31] this is the problem with bots, they don't know how to do stuff in their head [09:29:53] RECOVERY - YARN NodeManager JVM Heap usage on analytics1070 is OK: (C)0.95 ge (W)0.9 ge 0.6099 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:33:33] RECOVERY - YARN NodeManager JVM Heap usage on analytics1062 is OK: (C)0.95 ge (W)0.9 ge 0.5244 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:35:42] 10Analytics, 10User-Banyek: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 (10Banyek) Actually I was thinking on closing the task as we have 1,4 T free space now. Maybe before that just dropping the commonswiki_test_T177772 database with the recentchanges table which would give us a... [09:36:49] 10Analytics, 10User-Banyek: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 (10Marostegui) +1 to close. It is actually not a bad idea to leave that big DB as a safety net, so we have stuff to drop if this host complains again about disk space :-) [09:38:22] 10Analytics, 10User-Banyek: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 (10Banyek) 05Open>03Resolved Yes, that makes sense. I close the task now, we can reopen it when needed. [09:39:12] RECOVERY - YARN NodeManager JVM Heap usage on analytics1074 is OK: (C)0.95 ge (W)0.9 ge 0.6247 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:42:43] RECOVERY - YARN NodeManager JVM Heap usage on analytics1048 is OK: (C)0.95 ge (W)0.9 ge 0.6103 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:42:43] RECOVERY - YARN NodeManager JVM Heap usage on analytics1063 is OK: (C)0.95 ge (W)0.9 ge 0.5742 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [09:44:33] RECOVERY - YARN NodeManager JVM Heap usage on analytics1072 is OK: (C)0.95 ge (W)0.9 ge 0.5455 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [10:14:29] TIL: du seems not showing . files by default [10:16:13] ok notebook1003 back in shape [10:16:31] (/srv dir not filled up anymore thanks to Diego!) [10:20:17] joal: is https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/468381/ enough for refinery source? [10:20:27] (not planning any deployment, just wondering) [10:21:19] elukey: double checking [10:23:03] * elukey waits for the -1 [10:24:55] actually, nope - everything fine elukey :) [10:25:04] \o/ [10:25:16] I wondered if the global property was used correctly, and it seems so :) [10:29:52] (03CR) 10Joal: [C: 031] "LGTM :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468381 (owner: 10Elukey) [10:29:57] * elukey dances [10:29:59] heya teaammm [10:30:02] o/ [10:30:18] why you dancing elukey? :] [10:31:24] mforns: I got a +1 from Joseph at first try [10:31:28] achievement unlocked [10:31:34] (for refinery source) [10:32:29] yea [10:35:07] * joal looks for some music for elukey - https://www.youtube.com/watch?v=H_CenvaDGm0 [10:37:37] joal: do you have an example of webrequest indexation for druid that I can use on the fly to test how the druid nodes are doing? [10:37:47] (labs I mean) [10:38:23] elukey: batch indexation right? [10:38:57] yep yep [10:39:02] (nice music btw :) [10:40:04] elukey: I can find one yes :) [10:40:33] super thanks :) [10:42:26] elukey: I assume data size will be minimal? [10:43:23] I think so yes [10:43:36] camus works, just checked, but not sure about how much data it gathers [10:43:47] k [10:44:02] elukey: I think it depends how many fake web-calls are made [10:44:05] Let's check [10:45:37] the brokers (k4-1.analytics.eqiad.wmflabs and k4-2.analytics.eqiad.wmflabs) are up [10:46:13] ahahah it still contains all my fake topics [10:46:13] snap [10:46:17] I need to clean them up [10:48:57] I am deleting those in the meantime [10:49:11] sure [10:49:18] checking data presence as well [10:51:25] elukey: data exists for past: /wmf/data/wmf/webrequest/webrequest_source=text/year=2018/month=5/day=18/ [10:51:28] for instance [10:52:53] so it is just a matter of sending events to kafka [10:53:09] or just index those [10:53:16] (the ones already there) [10:53:56] For batch we can index the ones already there [10:56:25] elukey: https://gist.github.com/jobar/4851717ba74b1540bce217c3505a1f9c [10:56:33] elukey: not tested, but shouldn't be far from ok [10:57:08] <3 [11:09:53] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10mobrovac) >>! In T207329#4679089, @Pchelolo wrote: > The above patch should mitigate the problem, however, we need to also ac... [11:22:25] (03PS1) 10Mforns: Fix bug in EventLoggingToDruid, add time measures as dimensions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) [11:22:47] (03CR) 10Fdans: [V: 031] "Nuria: all the dependencies that were updated affect only the testing part of the project, and tests run fine. Also this fixes the build o" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467733 (https://phabricator.wikimedia.org/T206474) (owner: 10Fdans) [11:25:37] (03CR) 10Mforns: "I tested this with real data (navigationtiming) and it works (adds new time measure dimensions)." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [11:49:36] (03PS2) 10Joal: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) [11:49:57] (03CR) 10Joal: [V: 031] "Tested on cluster" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [11:50:46] looking at EL error throughput makes me so sad :( [11:51:30] (03CR) 10Elukey: [C: 032] Upgrade camus-wmf dependency to camus-wmf9 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468381 (owner: 10Elukey) [11:52:12] I am trying to deploy turnilo in labs, and the labs deployment server is broken.. [12:24:02] PROBLEM - Throughput of EventLogging EventError events on einsteinium is CRITICAL: 410.7 ge 30 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [12:27:25] :( [12:29:26] 407??? [12:29:41] the main issue is that we need to wait for a deployment.. [12:29:42] sigh [12:29:46] elukey: it keeps rising [12:30:25] elukey: Im assuming there have been a deploy yesterday, right? [12:31:06] joal: I am not sure if the thing is part of the mediawiki train or not [12:31:34] elukey: it has started yesterday night - Must have through either a deploy or a config change that can be reverted, no? [12:32:38] joal: I was under the impression that https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/468486/ was the fix [12:33:56] seems so elukey - Must have been a deploy yesterday [12:35:19] elukey: we could devise a patch for EL to actually not send that error to error schema for the time being, but it's a hack :( [12:35:49] there was https://tools.wmflabs.org/sal/production?d=2018-10-18 (last deploy at ~21:49 UTC) [12:37:08] It correspond to the alert-time elukey - Thanks for the triple check [12:47:09] ok so I may have a fix for the deployment-server in labs but I need to wait for reviews [12:47:13] will try to index now [12:57:11] (03CR) 10Matthias Mullie: [C: 032] Removing error messages from whitelist for schema UploadWizardExceptionFlowEvent [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467440 (https://phabricator.wikimedia.org/T136851) (owner: 10Nuria) [12:57:15] elukey: let me know if I can help [13:01:14] it seems succeeding (the datasource was 'webrequest' not 'test_webrequest') [13:01:26] but I get this from the middle manager [13:01:27] 2018-10-19T12:59:42,026 INFO org.apache.hadoop.mapreduce.Job: Running job: job_1536235072238_10961 [13:01:31] 2018-10-19T12:59:51,244 INFO org.apache.hadoop.mapreduce.Job: Job job_1536235072238_10961 running in uber mode : false [13:01:34] 2018-10-19T12:59:51,246 INFO org.apache.hadoop.mapreduce.Job: map 0% reduce 0% [13:01:37] 2018-10-19T13:00:00,703 INFO org.apache.hadoop.mapreduce.Job: Task Id : attempt_1536235072238_10961_m_000000_0, Status : FAILED [13:01:40] 2018-10-19T13:00:13,825 INFO org.apache.hadoop.mapreduce.Job: map 100% reduce 0% [13:01:43] 2018-10-19T13:00:22,890 INFO org.apache.hadoop.mapreduce.Job: map 100% reduce 100% [13:01:46] 2018-10-19T13:00:22,901 INFO org.apache.hadoop.mapreduce.Job: Job job_1536235072238_10961 completed successfully [13:01:49] 2018-10-19T13:00:23,035 INFO org.apache.hadoop.mapreduce.Job: Counters: 54 [13:04:22] so one mapper failed with Error: NULL_VALUE [13:04:58] 2018-10-19T13:00:00,395 ERROR [main] org.apache.hadoop.mapred.YarnChild - Error running child : java.lang.NoSuchFieldError: NULL_VALUE at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:245) [13:12:01] elukey: the hadoop job failed but indexation succeeded? I'm surprised :) [13:21:33] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Ottomata) Verified microseconds is fine with python jsonschema. I also checked Camus, which uses `[[ http://joda-time.source... [13:23:54] (03PS4) 10Joal: Add oozie job partitioning webrequest subset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) [13:26:21] (03CR) 10Ottomata: [C: 031] Fix bug in EventLoggingToDruid, add time measures as dimensions (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [13:28:57] (03CR) 10Ottomata: Add oozie job partitioning webrequest subset (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [13:30:55] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10NHarateh_WMF) @chelsyx this should be fixed when htt... [13:32:58] (03PS3) 10Joal: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) [13:42:08] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10kostajh) @Pchelolo @Ottomata and @mobrovac thank you for tracking this down and working on it. Should our team plan to verify... [13:47:46] 10Analytics, 10Operations, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) a:05elukey>03mforns [14:04:10] joal: here I am, yes it is kinda weird, trying to figure out why.. could it be related to weird data on hdfs? [14:43:44] (03CR) 10Mforns: Fix bug in EventLoggingToDruid, add time measures as dimensions (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [14:46:21] 10Analytics, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10AndyRussG) @Nuria, @JAllemandou thanks so much taking the time to check this out, much appreciated!!! We can di... [14:56:43] 10Analytics, 10Operations, 10ops-eqiad: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) @elukey Finally, they agreed to replace the mother board. This should happen Monday or Tues next week. [14:59:54] 10Analytics, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) @AndyRussG Can you answer these questions: * are all browsers receiving banners? * are only js-enabled... [15:03:56] I'm getting an error running "mvn test" that makes me feel I have some bad versions of something, but I tried mvn clean and it doesn't work [15:03:57] java.lang.NoClassDefFoundError: com/holdenkarau/spark/testing/SharedSparkContext [15:04:15] this is definitely not me then :) [15:04:18] there was some other more brutal way to clean that joal told me at some point... can't remember, but this time I'll put it in the readme [15:09:54] joal: found the problem, I had messed up in the deb package with parquet libs -.- [15:10:04] removed the rouge ones on the host, index fine :) [15:10:11] I am re-building the package no [15:10:14] *now [15:16:03] milimetric: you can rm -rf ~/.mvn local cache [15:17:28] doing that now, but needing to do it means some versioning is messed up in our poms [15:17:30] (03CR) 10Nuria: Memoizing results of state functions (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [15:19:11] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) [15:22:24] (03CR) 10Fdans: Set the active filter correctly on breakdowns mount (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) (owner: 10Fdans) [15:23:05] (03PS4) 10Fdans: Set the active filter correctly on breakdowns mount [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) [15:23:17] (03CR) 10Milimetric: Memoizing results of state functions (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [15:24:05] (03CR) 10jerkins-bot: [V: 04-1] Set the active filter correctly on breakdowns mount [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) (owner: 10Fdans) [15:24:55] what now [15:26:04] milimetric: also make sure you have 1.8 as java vs [15:26:26] milimetric: that is what sets your local cache versions [15:26:52] milimetric: issue could be with java vs , not poms per se [15:27:00] yeah, javac 1.8.0_181 but I still get the same error after rm -rf ~/.m2 [15:27:10] ottomata: re: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467648/ - we have 90d on stat1005 of logs no? And now 15d on eventlog1002, plus the camus importing data.. I thought it was fine, am I missing something? [15:27:17] (plus the srv partition was filled up again :( [15:27:19] (03PS5) 10Fdans: Set the active filter correctly on breakdowns mount [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) [15:28:26] milimetric: trying to repro your issue [15:28:26] people hate on npm, but you pick up a node project 4 years later and it builds. Python's ok for like 1 year, Java's ok for like 1 month, and Ruby's ok for like 2 days. [15:35:33] weird, I ran mvn test, it downloaded a bunch of stuff, failed. Now I'm running mvn verify, it's downloading more stuff [15:36:10] got a better error at least: [15:36:11] Could not resolve dependencies for project org.wikimedia.analytics.refinery.spark:refinery-spark:jar:0.0.79-SNAPSHOT: Could not find artifact org.wikimedia.analytics.refinery.core:refinery-core:jar:0.0.79-SNAPSHOT in wmf-mirrored (https://archiva.wikimedia.org/repository/mirrored/) [15:37:52] mvn compile downloads even more... [15:38:52] (03CR) 10Mforns: "I lean towards Dan's idea," [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [15:39:33] 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10fdans) Info added in wikitech for future reference! [15:39:43] 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10fdans) a:05fdans>03None [15:52:28] (03CR) 10Mforns: "This table is only partitioned by snapshot." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [15:53:21] 10Analytics, 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10Milimetric) [15:53:22] (03CR) 10Fdans: [V: 031] "Mforns Nuria: yep already tested with dry run" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [15:53:38] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965 (10Milimetric) [15:53:50] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0 Remaining reports. - https://phabricator.wikimedia.org/T186121 (10Milimetric) [15:54:00] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Beta - https://phabricator.wikimedia.org/T186120 (10Milimetric) [15:54:12] 10Analytics, 10Analytics-Kanban: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Milimetric) [15:55:23] (03CR) 10Fdans: [C: 031] Add mediawiki-history-wikitext oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/463548 (https://phabricator.wikimedia.org/T202490) (owner: 10Joal) [15:55:56] (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [15:56:12] thank youuu mforns [15:56:23] npppp :] [15:56:42] 10Analytics, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), 10Patch-For-Review: Improve Dashiki extension messaging - https://phabricator.wikimedia.org/T205644 (10Milimetric) The first task that I self-merged is now deployed: https://meta.wikimedia.org/wiki/Config:Dashiki:Anno... [16:00:08] milimetric, ottomata , fdans : standddupppp [16:00:27] nuria: someone's on a rush! [16:00:38] fdans: someone has no watch! [16:03:42] i sent an e scrum! [16:03:50] nuria: [16:03:52] ^^ [16:04:56] (03CR) 10Nuria: [V: 032 C: 032] Removing error messages from whitelist for schema UploadWizardExceptionFlowEvent [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467440 (https://phabricator.wikimedia.org/T136851) (owner: 10Nuria) [16:05:07] 10Analytics, 10Analytics-Kanban: Update datasets to have explicit timestamp for druid indexation facilitation - https://phabricator.wikimedia.org/T205617 (10Milimetric) a:03fdans [16:05:32] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update to cloudera 5.15 - https://phabricator.wikimedia.org/T204759 (10Milimetric) a:03elukey [16:06:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Presto cluster online and usable with test data pushed from analytics prod infrastructure accessible by Cloud (labs) users - https://phabricator.wikimedia.org/T204951 (10Milimetric) a:03Ottomata [16:08:14] ottomata: there's a special script thing that sets up beeline for users on the stat machines, right? a system user that can use hive wouldn't automatically be able to use beeline, correct? [16:11:28] (03CR) 10Fdans: [V: 032] Add mediawiki_history_reduced to list of tables to drop snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [16:12:02] bearloga: hi - You can find the default options set on stat machine via 'DEFAULT_OPTIONS = {'-n': os.environ['USER'],' [16:12:05] '-u': 'jdbc:hive2://an-coord1001.eqiad.wmnet:' + [16:12:05] woo [16:12:08] '10000', [16:12:10] '--outputformat': 'tsv2', } [16:12:24] I meant 'cat /usr/local/bin/beeline' bearloga, on a stat machine [16:12:28] sorry for the spam :) [16:13:57] (03PS2) 10Fdans: Add mediawiki_history_reduced to list of tables to drop snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) [16:14:04] (03CR) 10Fdans: [V: 032] Add mediawiki_history_reduced to list of tables to drop snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [16:14:45] joal: I was wondering because we tried to switch some queries that are run via reportupdater (by analytics-search system user) from hive to beeline and chelsyx had issues, so my best guess is that the system user isn't by default setup to use beeline, just hive [16:15:06] 10Analytics-Kanban: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users - https://phabricator.wikimedia.org/T204950 (10Milimetric) [16:15:12] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10Milimetric) [16:15:16] 10Analytics-Kanban: reportupdater TLC - https://phabricator.wikimedia.org/T193167 (10Milimetric) [16:15:20] 10Analytics-Kanban: Enable automatic ingestion from eventlogging into druid for some schemas - https://phabricator.wikimedia.org/T190855 (10Milimetric) [16:15:21] correct- acutally bearloga, we suggest not to use beeline [16:15:42] 10Analytics-Kanban: Raise Edit Data Quality to the point where we can offer snapshots on Cloud (labs) environment - https://phabricator.wikimedia.org/T204953 (10Milimetric) [16:15:44] 10Analytics-EventLogging, 10Analytics-Kanban: EventLogging sanitization - https://phabricator.wikimedia.org/T199898 (10Milimetric) [16:15:54] With the version of hive we have, it is still not as good as hive bare client [16:15:59] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Milimetric) [16:16:08] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Event Schema Registry - https://phabricator.wikimedia.org/T201063 (10Milimetric) [16:16:18] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201068 (10Milimetric) [16:16:27] 10Analytics-Kanban, 10User-Elukey: Q1 2018/19 Analytics procurement - https://phabricator.wikimedia.org/T198694 (10Milimetric) [16:16:29] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965 (10Milimetric) [16:16:34] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0 Remaining reports. - https://phabricator.wikimedia.org/T186121 (10Milimetric) [16:16:38] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Beta - https://phabricator.wikimedia.org/T186120 (10Milimetric) [16:16:42] 10Analytics-Kanban: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Milimetric) [16:17:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Milimetric) [16:17:20] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Reading-analysis: Final Vetting of Family Wide unique devices data - https://phabricator.wikimedia.org/T169550 (10Milimetric) [16:17:24] joal: what if we have to use beeline for certain queries because otherwise there are a bunch of messages that hive outputs that aren't caught by -S or current grep filters [16:17:27] 10Analytics, 10Analytics-Kanban: Quantify volume of traffic on piwik with DNT header set - https://phabricator.wikimedia.org/T199928 (10Milimetric) [16:17:39] 10Analytics, 10Analytics-Kanban: [EL sanitization] Write and productionize script to drop partitions older than 90 days in events database - https://phabricator.wikimedia.org/T199836 (10Milimetric) [16:17:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Drop old mediawiki_history_reduced snapshots - https://phabricator.wikimedia.org/T197888 (10Milimetric) [16:17:58] bearloga: I have heard of that yes, but you might run into other issues with beeline :( [16:18:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use spark to split webrequest on tags - https://phabricator.wikimedia.org/T164020 (10Milimetric) [16:19:14] joal: we'd use it on a per-query basis after checking that beeline runs it fine. I mean, if you're suggesting not to use beeline ever at all why even have it around? [16:19:58] bearloga: we never removed it - We actually confirgured it to work and wanted to follow the advise of moving out of it [16:20:37] bearloga: however we ran into more and more errors as people started using it - particularly due to memory-errors on local-join tasks (small memory for hive-server) [16:21:06] bearloga: and no real decision has been made on removing beeline [16:21:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Milimetric) a:03Milimetric [16:21:28] 10Analytics, 10Analytics-Kanban: Set a timeout for regex parsing in the Eventlogging processors - https://phabricator.wikimedia.org/T200760 (10Milimetric) a:03Milimetric [16:23:01] joal: got it. in that case, is there any way to have hive be less annoying with its output? running every query through a dozen greps to filter out unnecessary output seems…sub-optimal [16:23:22] bearloga: indeed!!! [16:23:39] bearloga: can ou send me the query? I think the logs are related to parquet... [16:25:44] 10Analytics, 10Analytics-Kanban: Quantify volume of traffic on piwik with DNT header set - https://phabricator.wikimedia.org/T199928 (10Milimetric) p:05Triage>03High [16:25:55] 10Analytics, 10Analytics-Kanban: [EL sanitization] Write and productionize script to drop partitions older than 90 days in events database - https://phabricator.wikimedia.org/T199836 (10Milimetric) p:05Triage>03High [16:26:12] * elukey off! [16:26:28] wait nuria you said you'd babysit me :) [16:27:29] (03CR) 10Milimetric: Set the active filter correctly on breakdowns mount (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) (owner: 10Fdans) [16:27:41] (03CR) 10Milimetric: "/me just trying to ruin Fran's Friday :)" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) (owner: 10Fdans) [16:28:27] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Services (watching): Prototype in node intake service - https://phabricator.wikimedia.org/T206815 (10Ottomata) Proceeding! https://github.com/ottomata/eventbus Need to move to gerrit. [16:29:14] Yay! oozie job partitioning webrequests ! [16:30:19] 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka-jumbo-1 - https://phabricator.wikimedia.org/T207489 (10Krenair) a:03Krenair Found this in the prefix config for deployment-kafka-jumbo: `profile::kafka::broker::monitoring::replica_maxlag_warning: '1000'`, changed it to remove the... [16:31:45] 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka-jumbo-1 - https://phabricator.wikimedia.org/T207489 (10Ottomata) Thanks! [16:32:57] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 6 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Pchelolo) @kostajh If you have time for that it would be perfect. I admit, I don't have any idea how to test this. Thank you... [16:33:03] (03PS6) 10Joal: Update DataFrameToHive for dynamic partitions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465202 (https://phabricator.wikimedia.org/T164020) [16:33:06] milimetric: yes! [16:33:06] (03PS7) 10Joal: Add webrequest_subset_tags transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465206 (https://phabricator.wikimedia.org/T164020) [16:33:08] (03PS4) 10Joal: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) [16:33:08] bc? [16:33:14] 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka-jumbo-1 - https://phabricator.wikimedia.org/T207489 (10Krenair) 05Open>03Resolved puppet runs again [16:33:27] milimetric: batcave? [16:33:36] yep, I'm there [16:34:12] (03PS5) 10Joal: Add oozie job partitioning webrequest subset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) [16:35:06] 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka-jumbo-1 - https://phabricator.wikimedia.org/T207489 (10Krenair) (I think -2 was also affected but seems fine now) [16:35:48] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 6 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10kostajh) It's easy enough for me to see if running "clear watchlist" on my enwiki account works :) @Etonkovidova may want to... [16:36:33] (03CR) 10Joal: [V: 031] "Tested on cluster" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/357814 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [16:38:07] joal: I pinged chelsyx to send you the query she was getting all the extra output from that made her want to switch to beeline [16:38:42] ok bearloga - depending on hour and knowing I'll be off Monday, it might only be Tuesday that I look at it :) [16:39:14] no problem joal :) [16:44:04] (03PS6) 10Fdans: Set the active filter correctly on breakdowns mount [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) [16:49:01] ottomata: do you know where/why intellij doesn't find the org.wikimedia.analytics.schemas package? [16:49:40] milimetric: just to be sure - those are the ones in the refinery-camus project, right? [16:49:42] in the refinery.camus.coders? [16:49:45] yea [16:50:08] milimetric: It's because the code needs to be generated manually first [16:50:42] milimetric: IIRC correctly the easiest is to build through maven CLI and look in the target folder for the generated classes [16:50:52] uh... is there a better way to do that? [16:51:16] so that it's automatic? [16:51:33] milimetric: not that I know [16:51:47] milimetric: maven does it, but I don't intellij knows how to [16:53:53] milimetric: the java files can be found on stat1004:/home/joal/generated.tgz [16:55:33] the java files show up in IntelliJ, under src/generated [16:55:54] but it looks like you have to configure IntelliJ to recognize those as a "sources" folder? [16:56:00] (doing that now, will let you know) [16:59:54] milimetric: that osunds right... [17:00:41] it worked ok to fix the build in IntelliJ, now it's giving me errors when I run all tests because it can't find some files like access_method_test_data.csv [17:00:57] there are also a lot of warnings, I'm going to spend some time and clean this up and update the README [17:03:44] milimetric: check that in your system ./refinery-core/target/test-classes/access_method_test_data.csv has read permits for all [17:04:29] hm, good thought, but yeah it's got r for all [17:05:54] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10chelsyx) Thank you @NHarateh_WMF ! [17:09:15] (03CR) 10Ottomata: [C: 031] Fix bug in EventLoggingToDruid, add time measures as dimensions (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:15:38] a-team, I'm not feeling well this evening, got a string cold, will stop for today... [17:16:34] byeee [17:19:30] no cold for me but stop nonetheless :) Have a good weekend team [17:19:30] joal: does your intellij have the same problem with these csv files? I changed one test to /home/milimetric/projects/refinery-source/refinery-core/src/test/resources/pageview_test_data.csv instead of src/test/resources/pageview_test_data.csv and it passed [17:19:39] oh nvm, good night joal [17:19:46] I have time milimetric :) [17:20:27] tests pass on command line but not in intellij, all because of these path issues, but that doesn't sound like a fun Friday night, joal, you should go [17:20:29] milimetric: are the test/resources folders recognized as resources in your projects? [17:20:37] I'll double check [17:21:31] they were recognized as Test Resources, changed to Resources to see [17:21:40] Test resources is same for me [17:22:00] yeah, java.io.FileNotFoundException: src/test/resources/pageview_test_data.csv (No such file or directory) [17:22:12] milimetric: in Paths tab, do you have "Use module compile output path" ? [17:22:34] joal: you know, nvm, I'm gonna blow out my .idea settings completely and start clean and document the steps I need to make it work in README [17:22:58] milimetric: this will for sure be helpful!! [17:23:12] ok, will do [17:23:39] have a nice weekend man [17:23:39] Gone for now then :) [17:23:46] ThYou too [17:23:53] Thanks, you too ... [18:20:40] (03CR) 10Nuria: [C: 032] Fix bug in EventLoggingToDruid, add time measures as dimensions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:20:52] (03CR) 10Nuria: [V: 032 C: 032] Fix bug in EventLoggingToDruid, add time measures as dimensions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:26:43] (03Merged) 10jenkins-bot: Fix bug in EventLoggingToDruid, add time measures as dimensions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468550 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:37:57] (03CR) 10Nuria: [V: 031 C: 032] Upgrade packages and commit package-lock to remove vulnerabilities [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467733 (https://phabricator.wikimedia.org/T206474) (owner: 10Fdans) [18:43:06] 10Analytics, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) @AndyRussG I would look at EL data and see if any browser is notably missing from the events you have se... [18:56:27] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Services (watching): Prototype in node intake service - https://phabricator.wikimedia.org/T206815 (10Ottomata) Heya @Pchelolo. I'm feeling good about the general layout and architecture for this prototype. Would love to go over it with you and/or have... [19:31:52] (03PS4) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [19:32:11] (03CR) 10Milimetric: [C: 032] Set the active filter correctly on breakdowns mount [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) (owner: 10Fdans) [19:33:00] (03CR) 10Nuria: "Please see 3 independent caches given 3 usages. Let me know if this is what you were thinking." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [19:47:22] (03CR) 10Milimetric: [C: 04-1] Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [21:35:17] (03PS1) 10Milimetric: [WIP] working on understanding and testing page history and quality [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468678 [21:37:08] mmmmk, I feel like I'm starting to understand page history reconstruction again. So now I'm going to go away for the weekend and forget it all. Have a nice weekend everyone!! :) [21:55:36] 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10Neil_P._Quinn_WMF) @fdans thank you! Is it worth pursuing @joal's suggestion? ("it could interesting to try to raise HiveServer2 ava... [23:19:31] (03PS5) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [23:22:16] (03PS6) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [23:22:41] (03CR) 10Nuria: Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria)