[10:02:00] * joal has an explanation that reconcialiates all examples of restore so far !!! [10:02:38] but /me has no direct solution as to how represent what happens in mediawiki-history :( [13:42:55] Taking a break a-team [14:38:11] Hey ottomata - Sorry I'm late [14:39:50] 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Phabricator, 10ECT-December-2014, 10Patch-For-Review: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#793868 (10MelodyKramer) Hello! I just received the monthly stats email ([Wikitech-l] Phabricator... [14:39:54] ottomata: Available if you wasnt to talk about spark2 :) [14:40:19] milimetric: Hi [14:40:29] hi joal [14:40:33] cave? [14:40:43] milimetric: if you have some time, please :) [14:41:19] joal: hi! [14:41:22] oh in ops sync [14:41:30] ottomata: I'm with milimetric [14:44:30] (03PS3) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [14:47:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [15:02:37] Hi urandom - I think we can cancel today, all of Europe is in bank holiday (All-Saints) [15:03:38] 10Analytics-EventLogging, 10Analytics-Kanban: Refine should parse user agent field as it is done on refinery pipeline - https://phabricator.wikimedia.org/T178440#3726654 (10Ottomata) Perhaps! We could detect the existence of a JSON string in a string field, and auto-parse it as a JSON sub object and use a str... [15:14:15] The web team wants to extend the current Popups experiment a bit more to gather more data. I understand we have no disk shortage issues on Hadoop currently, correct? https://phabricator.wikimedia.org/T178500#3723294 [15:31:23] 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Phabricator, 10ECT-December-2014, 10Patch-For-Review: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#3726755 (10Aklapper) @MelodyKramer: I nowadays dump //two// of these numbers at [[ https://meta.w... [15:43:21] 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Phabricator, 10ECT-December-2014, 10Patch-For-Review: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#3726771 (10MelodyKramer) Ah, thanks @Aklapper - is it possible to get that link (or the informati... [15:47:59] HaeB: no storage issues in hadoop. we are maintaining a temporary custom import/refine job for this schema, while we work on more generically supporting eventlogging data in hadoop [15:48:23] i think we can keep running the custom job for yall a while longer, seems fine with me [15:48:33] ok, thanks! [16:04:43] wee wow https://dist.apache.org/repos/dist/release/kafka/1.0.0/RELEASE_NOTES.html [16:08:06] joal: q, for jsonrefine cron, do you see any reason not to run in yarn client mode? [16:08:08] instead of cluster? [16:08:17] i was thikning cluster, buuut, why not client? then we can have logs local [16:08:40] 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Phabricator, 10ECT-December-2014, 10Patch-For-Review: Monthly report of total / active Phabricator users - https://phabricator.wikimedia.org/T1003#3726843 (10Aklapper) @MelodyKramer: The file to edit is https://phabricator.wikimedia.org/source/... [17:13:48] (03CR) 10Ottomata: [V: 032] JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [17:24:12] 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3727231 (10Pchelolo) [17:51:10] 10Analytics, 10Analytics-Wikimetrics, 10Easy, 10Google-Code-In-2015: can't remove users from cohort in Iceweasel (aka Firefox, works fine in Chromium) - https://phabricator.wikimedia.org/T115160#3727344 (10D3r1ck01) @Aklapper, can this be fast-forwarded to #google-code-in-2017? [18:17:56] a-team qq [18:18:17] am planning on refining both eventlogging and analytics into the same new database 'event' [18:18:23] stored in /wmf/data/event [18:18:31] thoughts? objections? [18:18:40] ottomata, what do you mean with analytics? [18:19:16] oh [18:19:17] sorry [18:19:25] eventlogging analytics and eventlogging eventbus [18:20:04] lgtm! [18:20:49] milimetric: ? [18:21:04] yes [18:21:07] what? [18:21:40] oh, sounds good to me ottomata [18:22:19] ok cool. [18:22:27] i'm going to do a refeinery source release and refinery deploy [18:23:11] (03PS1) 10Ottomata: Update changelog.md with JsonRefine job info [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/387840 [18:23:28] (03CR) 10Ottomata: [V: 032 C: 032] Update changelog.md with JsonRefine job info [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/387840 (owner: 10Ottomata) [18:35:41] 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Add option to not truncate Y-axis - https://phabricator.wikimedia.org/T178602#3697395 (10kaldari) Yay! Thanks everyone! [18:40:10] !log rerunning unique_devices-per_project_family-druid-monthly-wf-2017-10 [18:40:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:42:05] (03PS1) 10Ottomata: Add scala compile plugin to refinery-core [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/387845 [18:48:39] (03PS4) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [18:50:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [18:51:08] (03CR) 10Ottomata: [C: 032] Add scala compile plugin to refinery-core [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/387845 (owner: 10Ottomata) [18:55:40] (03PS5) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [18:57:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [18:57:24] (03PS6) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [18:58:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [19:02:26] mforns: the link at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#The_white-list is broken [19:02:49] HaeB, oh! 404 :[ [19:02:53] will fix, thanks! [19:05:33] !log deploying refinery with refinery/source 0.0.54 for JsonRefine job T162610 [19:05:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:05:40] T162610: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610 [19:10:04] HaeB, done, thanks for the heads up :] [19:12:05] mforns: thanks, but https://github.com/wikimedia/puppet/blob/production/modules/role/files/mariadb/eventlogging_purging_whitelist.tsv doesn't work for either? [19:13:51] HaeB, oh! there were 3 instances of that link, sorry [19:15:54] HaeB, OK done now [19:17:29] (03PS7) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [19:19:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [19:35:20] 10Analytics, 10Analytics-Wikistats: Wikistats Bug: Menu to select projects doesn't work - https://phabricator.wikimedia.org/T179530#3727714 (10jmatazzoni) [19:36:19] 10Analytics, 10Analytics-Wikistats: Wikistats Bug: Menu to select projects doesn't work (sometimes?) - https://phabricator.wikimedia.org/T179530#3727733 (10jmatazzoni) [19:47:01] (03PS8) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [20:06:20] !log rerunning pageview-druid-hourly-wf-2017-11-1-18 [20:06:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:31:56] Hey ottomata - Sorry for not having seen your question on client/cluster for logs [20:32:22] ottomata: I think both are fine, cluster allows for less memory to be used on client (driver is in client in client mode obiously) [20:32:28] joal: yeah, and also [20:32:34] i encountered another prob [20:32:34] but except from that I don't thin of anything else [20:32:35] firewall rules on an03 [20:32:40] ? [20:32:41] easier to make it use cluster [20:33:06] analytics1003 has base firewall, without rules for spark ephemeral ports [20:33:16] an03 doesn't full access to hadoop workers? [20:33:19] which means that the executors around the cluster can't contact the local client master [20:33:20] weird [20:33:25] client's can't contact it [20:33:30] ya also thought it was weird [20:33:33] Arf, make [20:33:37] but, doing cluster mode with --files hive-site.xml is working [20:33:41] great [20:33:49] probably better that way anyway [20:34:00] ottomata: more scalalbe [20:34:03] aye [20:34:36] As well ottomata for eventlogging, I +1 on haing everything at the smae place :) [20:34:47] Let me know when everything is up and running ! [20:35:06] ok cool [20:35:10] joal: its pretty much there! [20:35:14] little tweaks and stuff [20:35:17] I can spend a few minutes now and then to have a look at them, while breaking brains with mediawiki-history ;) [20:35:19] but check out the hive event database [20:36:16] this is a cron scheduled eventlogging refine job [20:36:16] https://yarn.wikimedia.org/cluster/app/application_1504006918778_238899 [20:36:24] started 6 mins ago [20:37:03] ottomata: I assume you cron check for already existing jobs? [20:37:14] yup [20:37:17] :D [20:37:31] awesome [20:37:43] https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/analytics_cluster/refinery/job/json_refine_job.pp#L61 [20:37:46] ${refinery_path}/bin/is-yarn-app-runnin [20:37:52] Will spend some time looking at the jobs tomorrow morning, and maybe do some queries / files size checks [20:38:40] ottomata: I'll reuse that script for productionisation of streaming jobs :) [20:38:45] Thanks for sharing [20:43:40] ottomata, how about double underscore to separate concepts in a field name, like: [20:43:46] event__page_id [20:43:53] user_agent [20:44:00] event_user_edit_count [20:44:10] *event__user_edit_count [20:44:20] timestamp [20:44:32] event__skin [20:45:25] but WHHYYYYYYY [20:45:30] xD [20:45:42] let it beeeee let it be! [20:45:46] let it beeeEEE oh let it bee [20:45:51] xDDDD [20:45:55] event_pageId [20:45:56] s'ok, no? [20:46:08] sucks for those old schemas, but TOO BAD [20:46:09] well, yes, but then... [20:46:12] event_userEditCount [20:46:13] webhost [20:46:13] ? [20:46:19] event_usereditcount [20:46:22] useragent [20:46:24] yeah yeah [20:46:25] i know [20:46:35] recvfrom [20:46:47] i argued your side big time over here [20:46:47] https://phabricator.wikimedia.org/T177078 [20:46:58] mforns: we hadn't totally decided to lower case, did we? [20:47:13] not sure [20:47:15] i forget [20:47:22] are you even creating a hive table out of this json exproted stuff? [20:47:32] no no [20:47:45] the table should already be there, from EL refined [20:48:22] ah i think you are fine then [20:48:25] so the easy way would be for me to lowercase all nested fields right? [20:48:29] the only issue with weird flattening is structs and hive [20:48:31] so [20:48:32] when you flatten [20:48:37] don't change case at all [20:50:20] your json output / druid fields will then be like event_userEditCount [20:51:18] but then... if someone wants to pass metric fields or blacklisted fields to EventLoggingToDruid [20:51:40] they will have to pass fieldNames differently whether the fields belong to the capsule or to the event... [20:51:59] like, if they want to blacklist webHost and event_editCount [20:52:18] they will have to pass "webhost", "event_editCount" [20:52:41] it's kinda inconsistent, I'm annoying :D [20:53:40] I think I'd rather go with all lowercase... [20:55:23] ottomata, ORRRR I could do the hack where EventLoggingToDruid knows the correct spelling of all capsule fields and corrects that! [20:55:40] yea, that would fix it [20:55:58] hah [20:56:14] yeah but, who is going to be running this thing? [20:56:16] likely just us [20:56:23] blacklist no biggy [20:56:25] yeeeeea [20:56:33] maybe that's annoying because then the druid schema will be inconsistent [20:56:48] but, really thought, i think its too bad for these crappy camel cased schemas! [20:57:04] i think we shoudln't feel so bad if they are weird [20:57:07] hehe [20:57:09] ok [20:57:42] and, ya, as you say, the capsule fields will be the only ones that might be strange [20:57:43] I'd like to try the correcting-capsule-case thing, and if it's less than say 6 lines of code, then do it [20:57:50] event_... will always be what would be expected [20:57:52] ok [20:57:55] yea [20:57:58] you are going to re camelCase them? [20:58:06] yes :] [20:58:08] oof [20:58:10] nasty [20:58:11] ok [20:58:55] note that this would continue to work in case we "fix" the source data in eventlogging refine [20:59:23] right? cause Hive is case insensitive? [21:00:12] Hey folks, gone for tonight, have a good night! [21:00:18] byeeeee :] [21:00:27] laters [21:00:30] fix? [21:02:39] ottomata, I mean, if me manage to refine EL events and keep the top level fields case [21:02:47] in the future [21:02:49] ah, mforns i just checked [21:02:54] even in hive sql, outside of spark [21:03:20] CREATE TABLE `otto.tableOne`( [21:03:20] `comment` string, [21:03:20] `camelCaseString` string, [21:03:20] `nested` struct [21:03:21] ); [21:03:36] xD [21:03:44] the table name shows as [21:03:46] tableone [21:03:49] but, show create talbe has [21:03:54] CREATE TABLE `tableone`( [21:03:54] `comment` string, [21:03:54] `camelcasestring` string, [21:03:54] `nested` struct) [21:03:56] so [21:04:02] aha [21:04:17] and if you try with dromedary? [21:04:19] xD [21:04:22] kiddin [21:04:22] haha [21:04:23] cool [21:04:27] this is weird: [21:04:32] so, no fix [21:04:46] hive (otto)> show create table TABleONE; [21:04:46] OK [21:04:46] createtab_stmt [21:04:46] CREATE TABLE `TABleONE`( [21:04:46] `comment` string, [21:04:47] `camelcasestring` string, [21:04:47] `nested` struct) [21:04:59] whaat! [21:05:15] ok [21:05:33] yeah, so its nasty [21:05:35] i think we can't fix. [21:05:49] hmm, lemme see what is in the mysql metastore for the schema def [21:05:50] can you select nested.camELhAsHUMp ? [21:05:51] rather than using cli [21:05:55] oh def [21:06:00] oh [21:06:01] HMM [21:06:01] maybe [21:06:03] good Q [21:06:04] trying [21:06:21] yes [21:06:21] i can [21:06:25] cool :] [21:06:26] query is case insenstive [21:09:27] huh dunno where these column defs are in mysql [21:09:34] but, the table name is def stored lowercased [21:14:21] aha [22:09:17] 10Analytics, 10Analytics-EventLogging: Timestamp format in Hive-refined EventLogging tables is incompatible with MySQL version - https://phabricator.wikimedia.org/T179540#3728047 (10Tbayer) [22:09:38] 10Analytics, 10Analytics-EventLogging: Timestamp format in Hive-refined EventLogging tables is incompatible with MySQL version - https://phabricator.wikimedia.org/T179540#3728063 (10Tbayer)