[05:44:55] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Dzahn) [05:52:00] 10Analytics, 10Analytics-Cluster, 10Operations: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10Dzahn) [07:06:50] Good mroning [07:35:36] morning team [08:18:36] 10Analytics, 10Analytics-Kanban: Sqoop: remove cuc_comment and join to comment table - https://phabricator.wikimedia.org/T217848 (10JAllemandou) Changing sqoop to join a base table to the `comment` one in the sqoop-SQL query has been tested for `mediawiki-history` and has lead to non-acceptable performance (to... [11:19:52] (03PS4) 10Fdans: (wip) Add cassandra loading job for requests per file metric [analytics/refinery] - 10https://gerrit.wikimedia.org/r/533921 (https://phabricator.wikimedia.org/T228149) [11:36:08] Voice - https://twitter.com/paul_haine/status/1168953153158598656 [12:53:00] 10Analytics, 10Analytics-Kanban: Cleanup refinery artifacts folder from unneeded jars - https://phabricator.wikimedia.org/T231856 (10JAllemandou) Audit of versioned jar files needed (non-versioned files should always be there): * In puppet repo: | **jar** | **Defined in** | | `camus-wmf-0.1.0-wmf9.jar` | `mod... [12:54:52] 10Analytics, 10Analytics-Kanban: Cleanup refinery artifacts folder from unneeded jars - https://phabricator.wikimedia.org/T231856 (10JAllemandou) Question for @Ottomata and @Nuria : Do we prefer to move old jars to new ones and get rid of every jar older than version X, or do we get rid of currently-unused jar... [13:01:45] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10EventBus, and 5 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Nuria) >Q. Do we need to indicate that certain fields are PII only in combination with each other, like country + us... [13:03:06] 10Analytics, 10Analytics-Kanban: Cleanup refinery artifacts folder from unneeded jars - https://phabricator.wikimedia.org/T231856 (10Nuria) I think removed unused jars would work for now, right? [13:15:49] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) ping @Groceryheist I don't know ryanmax's phab id, so I will email him. [13:23:29] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) Give that this is likely to impact other users can we temporarily compress that directory ( (/home/ryanmax) to make up space? [13:31:33] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10EventBus, and 5 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Ottomata) > maybe it is worth thinking of ingestion guidelines as being another document rather than them being spec... [13:31:46] 10Analytics, 10Analytics-Kanban: Cleanup refinery artifacts folder from unneeded jars - https://phabricator.wikimedia.org/T231856 (10Ottomata) Ya let's just remove all currently unused. [13:34:22] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) I deleted a little bit from my home dir, so we have a little bit of room for a bit. I'll give them a little time to respond. [13:36:34] (03PS1) 10Joal: Cleanup artifacts folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/534611 (https://phabricator.wikimedia.org/T231856) [13:45:07] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10RyanSteinberg) I just deleted some files and I'm compressing others. I didn't realize space was so tight ... my apologies. [13:49:15] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) The Jupyter Notebook servers are meant mostly to be an GUI/Cli interface to Hadoop based systems. If you can, please consider storing data in HDFS. [14:01:48] 10Analytics, 10Analytics-Cluster, 10Operations: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) @RyanSteinberg + 1 to andrew's suggestion. data should not be kept on notebook servers, rather you can keep it on your user database in hadoop. This is due to space concerns in no... [15:32:13] nuria: 1:1? [15:32:20] ottomata: yessir [15:32:23] ottomata: joining [15:57:08] Hi team - Shall we skip standup today for WMf-Staff? [16:01:16] no one in the batcave... all in the staff meeting? [16:01:27] ok, going [16:01:31] mforns: So am I [16:02:09] ottomata, joal , mforns , fdans : standup coincides with staff meeting, let's move it to after? [16:02:16] works for me [16:02:24] ok [16:02:27] fdans will not be there [16:02:31] (see email) [16:02:42] joal: right! [16:02:51] ottomata: does that sound good? [16:03:06] ottomata: standup will be in an hour (at the time we had reserved for groskin) [16:03:13] oh? [16:03:14] ok [16:04:21] (03CR) 10Nuria: [C: 03+1] "Feng shui," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/534611 (https://phabricator.wikimedia.org/T231856) (owner: 10Joal) [16:07:52] (03CR) 10Ottomata: [C: 03+1] ""Marie Condo"" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/534611 (https://phabricator.wikimedia.org/T231856) (owner: 10Joal) [16:09:17] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Next steps? [16:27:32] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705 (10Pchelolo) 05Open→03Resolved [16:27:41] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 4 others: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 (10Pchelolo) [16:32:19] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Decomission eventlogging-service-eventbus - https://phabricator.wikimedia.org/T232122 (10Ottomata) [16:33:20] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 4 others: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 (10Ottomata) [16:37:45] 10Analytics: Parse wikidumps and extract redirect information for 1 small wiki, romanian - https://phabricator.wikimedia.org/T232123 (10Nuria) [16:58:41] any idea what can cause intermittent hive job failures? They are run through hive with oozie and the hive2 action. I re-ran about 900 jobs last night and ~100 failed. I re-ran those 100 and 7 of those failed [16:58:58] they also intermittently fail on their own, it's just more obvious when running a bunch in a short period [17:02:35] hm, no, have you inspected app logs of some of the failed jobs? [17:03:27] ping joal , mforns standduppp [17:03:31] yess [17:03:39] ottomata: yea, but they just do nothing and then say hive exited with non-0 exit code :( [17:03:52] ebernhardson: we can look in detail at task logs , did you do that? [17:05:17] ebernhardson: beyond application logs [17:05:21] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 4 others: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 (10Ottomata) [17:05:31] nuria: i just look at `yarn logs -applicationId ...` [17:05:50] nuria: for example, application_1564562750409_156904 [17:06:05] ebernhardson: sudo -u hdfs mapred job -logs task-id [17:06:12] i can't sudo to hdfs :P [17:06:14] ebernhardson: task ids appear on application logs [17:06:25] ebernhardson: it does not matter, they will be on discovery user [17:06:54] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team Legacy (Watching / External), and 2 others: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 (10Ottomata) [17:06:58] ebernhardson: so you just have to be the user that owns the logs, on meeting can talk in a bit [17:07:29] nuria: hmm, looking. hdfs says `could not find or load main class mapred' [17:07:56] ebernhardson: sorry, hdfs mapred job -logs task-id [17:29:36] ebernhardson: my reading tells me it could be related to hive.autoconvert - But I don't understand how it could happen erratically [17:30:27] joal: it looks like whats happening is they all run the `add jar ...`, but then some subset of runs fail to find a class in that jar [17:30:48] ebernhardson: https://community.cloudera.com/t5/Support-Questions/Hive-reloadable-udf-random-Unable-to-find-class-error/td-p/89138 maybe? [17:30:52] joal: i tried a simple hive script looped on stat1007, but it refuses to fail after 30 runs. Will try again with an oozie workflow, maybe some machine in the cluster is behaving differently [17:31:54] ebernhardson: some nodes are fuller than others, maybe tmp being full, jar can't be copied and job fials [17:33:16] hmm, doesn't seem too close but possibly. If tmp is full that could do it i suppose, it's the refinery-hive jar but it's only 40MB [17:33:26] 10Analytics, 10Operations, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [17:40:07] 10Analytics, 10Analytics-Kanban: Upgrade python ua parser to 0.6.3 version - https://phabricator.wikimedia.org/T212854 (10Nuria) a:03JAllemandou [17:44:50] ebernhardson, joaL; sounds unlikely that a full tmp would cause it the problem [17:45:14] ebernhardson: were you able to look at task logs (maybe joal did?) [17:45:30] 10Analytics, 10Analytics-Kanban: Correct oozie jobs parameterization - https://phabricator.wikimedia.org/T231787 (10Nuria) 05Open→03Resolved [17:45:55] ebernhardson, nuria - Non-regular failures feel like specific-nodes issue to me (either space or something else) [17:45:57] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals: Review all the oozie coordinators/bundles in Refinery to add alerting when missing - https://phabricator.wikimedia.org/T228747 (10Nuria) [17:46:22] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generate edit totals by country by month/year - https://phabricator.wikimedia.org/T215655 (10Nuria) 05Open→03Resolved [17:47:10] joal: ya, that i agree but space seems unlikely [17:47:41] ebernhardson: do you have the error handy so i can take a look? [17:48:04] Error: Error while compiling statement: FAILED: SemanticException Generate Map Join Task Error: Unable to find class: org.wikimedia.analytics.refinery.hive.GetMainSearchRequestUDF [17:48:08] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals, 10cloud-services-team: https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-17:00 - https://phabricator.wikimedia.org/T228731 (10Nuria) [17:48:10] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals: Review all the oozie coordinators/bundles in Refinery to add alerting when missing - https://phabricator.wikimedia.org/T228747 (10Nuria) 05Open→03Resolved [17:48:39] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow - https://phabricator.wikimedia.org/T222254 (10Nuria) 05Open→03Resolved [17:48:42] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Spark to 2.4.x - https://phabricator.wikimedia.org/T222253 (10Nuria) [17:51:27] ebernhardson: and the machines in which it fails are always the same ones ? (the task logs have the nodes i think) [17:52:39] nuria: i still wasn't able to pull the task logs, it's also not as easy to pull all the failures out of hue i have to copy/paste one at a time, open the logs, etc. Is there a better way to ask what host all the failed jobs ran on? [17:53:46] ebernhardson: Oh I thought you had that solved - yarn logs --applicationId application_1564562750409_156904 --appOwner ebernhardson | less [17:53:49] ebernhardson: waht i do is to get yarn logs for applicationId and after look at some task logs but all in the cmd line, hue has never been that useful to find issues [17:53:51] ebernhardson: sorry [17:54:02] joal: right, for one application. But then i need to look at the ~100 failed jobs [17:54:20] nuria: right, i have to copy the applicationId from hue one at a time [17:54:30] ebernhardson: ca you send me your link to hue so i know what you mean? [17:54:32] *can [17:54:40] ebernhardson: we could ask oozie CLI as well [17:54:42] nuria: i dont get the logs from hue, just the list of application id's that fail [17:55:07] also annoyingly after i re-run them it doesn't show them anymore: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0025649-190730075836326-oozie-oozi-C/ [17:55:22] and while they fail oozie, the actual mapreduce job isn't marked as failure, so i can't find them in yarn.wikimedia.org failures tab :( [17:56:18] ebernhardson: This last bit is bizarre - mapreduce has failed for me [17:56:54] joal: hmm, how so? [17:58:14] ebernhardson: I mean I think mapreduce should have failed [17:58:48] joal: hmm, https://yarn.wikimedia.org/cluster/apps/FAILED is completely empty? [17:58:55] yeah I see that [18:00:34] ebernhardson: oozie jobs oozie jobs -filter status=KILLED -filter user=ebernhardson [18:00:39] ottomata: i thought user analytics would have access to other jobs like eriks like: sudo -u analytics yarn logs -applicationId application_1564562750409_156904 -appOwner ebernhardson > application_1564562750409_156904.log [18:01:22] joal: ahha, indeed that lists a bunch that seem like our failures. Lemme see if can translate those into application id's and grpe the logs to see which hosts were involed [18:01:51] that'd be interesting [18:02:20] ebernhardson: ya, you can do it like: [18:02:24] https://www.irccloud.com/pastebin/dyH5sxmG/ [18:02:42] ebernhardson: see phases? app id is job_1564562750409_156904 [18:03:00] ebernhardson: I think the problems happened on ApplicationMaster [18:03:32] joal: I just do not get how can we look at erik's logs with analytics user, do you know? [18:04:20] nuria: we need hdfs user [18:04:26] analytics user is no superuser [18:04:35] this was the point to change it :) [18:04:53] joal: INDEED, nuria 1$ on the jar [18:16:30] joal: so i looked at [18:16:32] sudo -u hdfs mapred job -logs job_1564562750409_156904 attempt_1564562750409_156904_m_000000_0 [18:16:54] joal: and i do not see anything other the already mentioned error [18:17:49] not seeing any patterns in the hosts involved, it's only looked at ~75 jobs so far, but there are 2 hosts per job and 54 unique hosts involved [18:18:10] (only one of the two matters, but grepping yarn logs can't tell which of the two to use) [18:18:21] :( [18:18:40] nuria: sudo -u hdfs yarn logs --applicationId application_1564562750409_156904 --appOwner ebernhardson | grep -A 20 Erro [18:19:24] ebernhardson: ya, this is a mistery [18:19:25] it's not the biggest problem, but it intermittently complains, and then i reran some jobs and it sent me 100 failure emails :) [18:19:46] re-run works eventually, usually first try. It was only re-running 100 jobs that needed a second try [18:20:20] ebernhardson: feels like that could be the one - https://issues.apache.org/jira/browse/HIVE-14555 [18:20:50] ebernhardson: fixed in hive 2.0 :( [18:20:51] ebernhardson: has this happened for ever? the hive to beeline transition is the only systemic change that comes to mind [18:21:20] nuria: it's only started happening since the hive2 transition [18:21:29] nuria: but i dont know thats related, its just the only thing recently changed. [18:23:01] ebernhardson: Does your request make use of the UDF in a join clause? [18:23:29] nuria: not a lot though, i have fail emails for aug 23, 24, sep 2. These run hourly [18:23:36] ebernhardson: the refinery jar version is refinery_jar_version=0.0.39 [18:23:45] ebernhardson: is this right? [18:23:56] joal: not as a join condition, but in a joinde table [18:24:12] ebernhardson: I was reading that [18:24:19] nuria: thats commented out :( This is sourcing /user/ebernhardson/refinery-hive-0.0.91-SNAPSHOT.jar [18:24:26] i forget why it's on a snapshot, there was some thing .... [18:25:01] it can probably switch back to a recent release though. Since it was hard coded to the /user/... i didn't set the refinery_jar_version variable [18:25:44] ebernhardson: I'm trying to think of a low-tech solution for the issue [18:25:54] ebernhardson: We've not experienced it on analytics jobs yet [18:26:09] joal: the low-tech solution is an auto-retryer :) [18:26:18] ebernhardson: I think it's because we mostly don't use joins [18:26:33] ebernhardson: right - IIRC oozie can do that [18:26:36] i intended to switch this to spark a few times, probably should. [18:26:50] it's just a messy hql query so porting it isn't fun :P [18:26:56] ebernhardson: this would be a real solver for sure (plus possibly some real perf gain) [18:27:12] ebernhardson: there are chances it works actually almost out of the box [18:27:31] joal: just pass into spark.sql(...)? I suppose can try and see [18:27:42] ebernhardson: https://stackoverflow.com/questions/38304821/how-to-auto-rerun-of-failed-action-in-oozie [18:27:51] indeed [18:28:20] anyway - sorry for the no-solution ebernhardson :S [18:28:31] joal: thanks, still have ideas :) [18:28:39] gone for today team - see you tomorrow [18:28:53] enjoy! [18:36:17] 10Analytics, 10Discovery, 10Operations, 10Research-Backlog, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10debt) [19:18:10] ebernhardson: I think it is worth trying whether removing ADD JAR hdfs://analytics-hadoop/user/ebernhardson/refinery-hive-0.0.91-SNAPSHOT.jar; would fix the issue [19:18:42] ebernhardson: and just running with the latest refinery jar [19:19:31] ebernhardson: also i wonder if adding the jar twice like: [19:20:02] https://www.irccloud.com/pastebin/cDVbKKb8/ [19:21:01] ebernhardson: might fix it (although i think joseph is right, this is a hive/beeline problem) , worth trying [19:28:06] ebernhardson: also i think it is very possible that beeline is executing part of the sql outside the ADD JAR Cmd , just with teh commanline arguments so changing refinery_jar_version=0.0.39 to a current jar might fix matters [19:31:33] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @ottomata the on-site work is done, They will need updated production DNS but all are moved and c... [19:41:53] ebernhardson: this one: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/oozie/query_clicks/hourly/coordinator.properties#L25 [19:52:13] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Services (watching): Migrate all event-schemas schemas to current.yaml and materialize with jsonschema-tools. - https://phabricator.wikimedia.org/T232144 (10Ottomata) [19:52:17] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Services (watching): Migrate all event-schemas schemas to current.yaml and materialize with jsonschema-tools. - https://phabricator.wikimedia.org/T232144 (10Ottomata) [20:48:23] 10Analytics, 10Analytics-Kanban: Upgrade python ua parser to latest version - https://phabricator.wikimedia.org/T212854 (10Nuria) [21:02:27] We should have a relatively stable set of metric names. [21:02:41] When we deploy new models, that should add a small set -- maybe once per month. [21:03:05] Woah. Wrong channel. :) [21:49:00] 10Analytics: Parse wikidumps and extract redirect information for 1 small wiki, romanian - https://phabricator.wikimedia.org/T232123 (10Nuria) [22:57:14] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Services (watching): Migrate all event-schemas schemas to current.yaml and materialize with jsonschema-tools. - https://phabricator.wikimedia.org/T232144 (10Pchelolo) 05Open→03Resolved [22:57:17] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Services (watching): CI Support for Schema Registry - https://phabricator.wikimedia.org/T206814 (10Pchelolo) [23:15:50] 10Analytics, 10Product-Analytics: event_user_id is always NULL for anonymous edits in Mediawiki History table - https://phabricator.wikimedia.org/T232171 (10nettrom_WMF) [23:26:45] 10Analytics, 10Analytics-Kanban: Wikistats: month on dashboard changes on any redraw - https://phabricator.wikimedia.org/T230514 (10Nuria) 05Open→03Resolved [23:26:54] 10Analytics, 10Analytics-Kanban: Wikistats: month on dashboard changes on any redraw - https://phabricator.wikimedia.org/T230514 (10Nuria)