[07:45:50] o/ [07:47:59] o/ [07:48:09] Bologna is covered with snow :) [08:50:08] Pictures elukey ! [08:50:25] elukey: do you want us to stop camus ? [08:50:57] on my side the trick-to-join has worked this weekend, but there is a bug :) [08:51:31] joal: sure, I am completing the work for weblog1001 in the meantime [08:51:44] ah wait better to stop it now right? [08:51:57] elukey: I'm sorry Iu have not followed what happens with weblog :( [08:52:13] elukey: It would be great yes [08:52:20] elukey: just before passing the hour [08:52:29] ack, no need to be sorry, I am moving one host to another one for ops, nothing more :) [08:52:54] ah ok - I'm always sorry not to know everything - you know me ;) [08:53:16] * joal is in a continuous-improvement that will never end :) [08:55:40] joal: mmmmm let's maybe do it next hour, because I want to investigate the HDFS balancer first, IIRC the high volume of logs was related to it moving stuff around [08:55:45] for the new workers [08:55:58] since we are not in rush, I'd say to be safe and not sorry :) [08:56:10] what do you mean with trick-to-join ? [08:56:18] elukey: no problem - It makes sense elukey that balancer had a lot of work when we grew the cluster (completely forgot about that thoug) [08:56:43] yeah [08:56:47] poor balancer [08:56:58] elukey: for once it had real stuff to do ;) [09:42:15] joal: do you have an example of SQL query to druid? [09:51:12] ah it is a simple {"query":"SELECT COUNT(*) AS TheCount FROM data_source"} [10:01:19] ok added an example in https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Druid#Example_of_queries [10:01:23] so I will not ask anymore :P [10:07:45] (03PS1) 10Michael Große: Add build for deployment [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/480036 (https://phabricator.wikimedia.org/T209399) [10:09:05] (03CR) 10Michael Große: "Hope this is the right way to get a change deployed? The associated source change is there: https://gerrit.wikimedia.org/r/c/analytics/wmd" [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/480036 (https://phabricator.wikimedia.org/T209399) (owner: 10Michael Große) [10:11:49] elukey: I actually didn't have any - I never use SQL with Druid [10:14:51] it is really cooooool [10:15:42] :) [10:18:20] joal: is there the change to keep data for webrequest_sampled_128 on druid more than 7d? [10:18:32] elukey: ?? [10:18:33] we have a special "customer" from SRE (faidon) [10:18:45] *chance sorry [10:20:39] elukey: The reason for which we don't is space (and with space, computation/IOs/load with longer queries) [10:20:49] elukey: We can do it though [10:21:17] yes I imagined, but I also thought that we settled for 7d since we didn't have any active user (yet) [10:21:27] so it was pointless to store a huge amount of data [10:22:00] the use case from SRE is to build reports for Traffic (Top ISPs, etc..) [10:22:15] they are useful to peer with other parties etc.. [10:23:41] For comparison elukey: 3+ years of pageview_daily = 50Gb, 3 month of pageview-hourly = 145Gb, 7 days of webrequest_128 = 100Gb [10:24:13] lol [10:24:52] elukey: For precise use-cases (not exploration-helpers), we can look into seting up a dedicated datasource with the needed fields, and see if would be better in term of space [10:24:56] For instance [10:25:02] I think I'd prefer this approahc [10:25:27] ah yes [10:28:41] very smart [10:28:57] I asked to open a task so we can review it all together [10:29:02] sounds good :) [10:56:34] we have python 3.7 on stretch now! [10:56:36] \o/ [10:56:56] and 3.6 of course [11:02:45] 10Analytics-Kanban, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [11:03:48] 10Analytics-Kanban, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [11:05:10] 10Analytics-Kanban, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [11:07:36] 10Analytics, 10Analytics-Kanban: Port our "fixes" for python 3.5 to our superset fork https://github.com/wikimedia/incubator-superset - https://phabricator.wikimedia.org/T211932 (10elukey) With https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/ the SRE enabled us to use Python 3.6 on Stretch (witho... [11:12:15] fdans: --^ [11:44:34] * elukey lunch! [12:29:37] (03CR) 10Addshore: [C: 04-1] "Because the jar now makes a couple of web requests we are going to have to change the cron that runs the jar to include webproxy settings." [analytics/wmde/toolkit-analyzer-build] - 10https://gerrit.wikimedia.org/r/480036 (https://phabricator.wikimedia.org/T209399) (owner: 10Michael Große) [14:20:35] hey team :] [14:24:58] hola marcel :) [14:25:10] hey lucaaa [14:44:30] ottomata: o/ [14:44:31] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/ [14:44:47] Christmas gift from SRE :D [14:44:48] hio! [14:45:27] NIIICE! [14:45:35] woww so cool [14:46:06] in theory we should be good to test Superset 0.29 (when it comes out) [14:47:31] also did you see https://www.confluent.io/confluent-community-license-faq ? [14:48:47] ah ya i saw that [14:48:56] i think that's fine for us, [14:49:04] we also don't use (yet) any confluent specific stuff [14:49:14] hm, i guess we have tentative plans to use the hdfs connector [14:49:23] maybe ksql maybe not tho [14:50:00] super [14:50:32] (luckily we didn't choose to use rest proxy :) [14:52:03] python 3 woowoo! [14:53:52] 10Analytics: Clean up home dirs for users jamesur and nithum - https://phabricator.wikimedia.org/T212127 (10elukey) p:05Triage→03Normal [15:06:48] (03PS1) 10Milimetric: Disable tooltips on compact screens [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/480090 (https://phabricator.wikimedia.org/T212023) [15:07:06] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Disable tooltips on compact screens [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/480090 (https://phabricator.wikimedia.org/T212023) (owner: 10Milimetric) [15:07:58] (03PS1) 10Milimetric: Release 2.5.2 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/480091 [15:08:18] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Release 2.5.2 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/480091 (owner: 10Milimetric) [15:10:14] 10Analytics, 10Analytics-Kanban, 10Datasets-General-or-Unknown, 10Patch-For-Review: cron job rsyncing dumps webserver logs to stat1005 is broken - https://phabricator.wikimedia.org/T211330 (10Milimetric) a:03elukey [15:12:54] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore, 10User-Elukey: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410 (10elukey) >>! In T172410#4824944, @Neil_P._Quinn_WMF wrote: > To help you understand our needs, we want to start adding o... [15:15:03] hm the re-refine from friday didn't actually do what I expected...doing again but using our new --ignore_done_flag [15:17:05] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: [BUG] User agent parsing error for MobileWikiAppSearch table - https://phabricator.wikimedia.org/T211833 (10Ottomata) a:05Nuria→03Ottomata [15:23:18] !log re-running refine_eventlogging_analytics with --ignore_done_flag (backfilling didn't complete properly on friday) - T211833 [15:23:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:23:21] T211833: [BUG] User agent parsing error for MobileWikiAppSearch table - https://phabricator.wikimedia.org/T211833 [15:27:46] (03PS2) 10Fdans: Add version number to footer [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/479223 [15:28:09] (03CR) 10Fdans: "Both suggestions applied :)" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/479223 (owner: 10Fdans) [15:31:39] (03PS2) 10Fdans: Don't add links if all-projects is selected [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/479251 [15:31:54] (03CR) 10Fdans: "@Nuria sorry, corrected now!" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/479251 (owner: 10Fdans) [15:32:57] ottomata, should I wait a bit before I run ELsanitization backfill then? [15:34:00] yea [15:34:16] mforns: i do wonder if there is a better way to do this... [15:34:19] we know oozie isn'ti t [15:34:31] because it can't keep track of dynamic datasets like this [15:34:37] but we do need a way to rerun a chain of dependencies [15:34:41] would airflow help us here? [15:34:46] 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10elukey) p:05Triage→03Normal [15:35:17] * elukey waits for joal reading airflow [15:35:23] (hehe me too) [15:35:26] ottomata, don't know about airflow [15:35:30] it seems like it might...kinda like oozie but programmable with python [15:35:36] instead of rigid xml with variables [15:36:32] by what you guys say, sounds worth a try [15:36:49] could replace reportupdater as well [15:37:00] ottomata: just to be sure, we don't use spark 1.6 for any production job right? Only 2.x [15:37:12] possbily oozie too, although that's a much larger project it isn't worth embarking on right now i think [15:37:17] elukey: dats right :) [15:38:02] https://airflow.apache.org/tutorial.html#backfill [15:38:25] probably a bunch of the stuff i've implemented in refine & refine target could be done in airflow and made more generic [15:38:34] backfilling, finding targets [15:39:15] oo elukey https://airflow.apache.org/howto/run-with-systemd.html [15:39:43] (03PS10) 10Mforns: Allow for custom transforms in DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) [15:39:53] 10Analytics, 10Analytics-Kanban, 10Datasets-General-or-Unknown, 10Patch-For-Review: cron job rsyncing dumps webserver logs to stat1005 is broken - https://phabricator.wikimedia.org/T211330 (10ArielGlenn) I'm not getting cronspam about this; is it still a problem? Also, I thought stat1005 was basically gone... [15:40:15] we should make airflow a propject sometime, maybe early next FY :) [15:41:32] agree [15:42:18] though, I thing that oozie has been working flawlessly for jobs with static data sources [15:43:30] ya [15:43:42] *think [15:55:33] 10Analytics, 10Analytics-Kanban, 10Datasets-General-or-Unknown, 10Patch-For-Review: cron job rsyncing dumps webserver logs to stat1005 is broken - https://phabricator.wikimedia.org/T211330 (10elukey) @ArielGlenn after https://gerrit.wikimedia.org/r/478022 we should be ok, I think that it is fine to leave t... [16:18:43] (03CR) 10Ottomata: [C: 03+1] "One Q but this looks awesome." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [16:29:03] (03CR) 10Mforns: [C: 04-2] Allow for custom transforms in DataFrameToDruid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [16:43:22] 10Analytics, 10Analytics-Kanban, 10DBA, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Marostegui) >>! In T210693#4825092, @bd808 wrote: >>>! In T210693#4824836, @Milimetric wrote: >> I'm not sure fo... [16:44:42] ottomata: I was wondering.. in the cdh module, would it be ok to get rid of the various default classes? [16:44:50] basically setting the defaults in the class itself [16:45:03] or are the defaults heavily re-used? [16:48:31] elukey: yes, that was kind of an older convention [16:48:37] ack! [16:48:38] i think there are times where the defaults are varied or referenced [16:48:40] but there probably aren't many [16:51:54] (03CR) 10Ottomata: [C: 03+1] Allow for custom transforms in DataFrameToDruid (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [17:00:53] 10Analytics, 10Analytics-Kanban: Port our "fixes" for python 3.5 to our superset fork https://github.com/wikimedia/incubator-superset - https://phabricator.wikimedia.org/T211932 (10Nuria) While that would be ideal seeing the churn of changes in project and the low level of quality of master for 0.28 release I... [17:04:11] 10Analytics, 10Analytics-Kanban: Port our "fixes" for python 3.5 to our superset fork https://github.com/wikimedia/incubator-superset - https://phabricator.wikimedia.org/T211932 (10elukey) >>! In T211932#4828279, @Nuria wrote: > While that would be ideal seeing the churn of changes in project and the low level... [17:15:59] 10Analytics, 10Analytics-Wikistats: Wikistats 2 pageviews trend figure is wrong - https://phabricator.wikimedia.org/T212032 (10fdans) For now we'll be removing the time period trend since as you said it doesn't add value in its current form. We'll task later on how to display changes over time periods in a mor... [17:16:12] 10Analytics, 10Analytics-Wikistats: Wikistats 2 pageviews trend figure is wrong - https://phabricator.wikimedia.org/T212032 (10fdans) p:05Triage→03High [17:17:16] 10Analytics, 10Analytics-Kanban: Sanitization should be run a second time - https://phabricator.wikimedia.org/T212014 (10fdans) a:03mforns [17:17:27] 10Analytics, 10Analytics-Kanban: Sanitization should be run a second time - https://phabricator.wikimedia.org/T212014 (10fdans) p:05Triage→03High [17:26:42] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10fdans) We should be able to fix this by patching the event client, making sure that the correct number type is sent, as opposed to correcti... [17:27:01] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10fdans) p:05Triage→03Normal [17:28:41] oo joal a-team: https://airbnb.io/airpal/ [17:29:02] ottomata: I have seen that already :) [17:29:11] ottomata: I don't want us to replace quarry though :) [17:29:13] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Add partial blocks to mediawiki history tables - https://phabricator.wikimedia.org/T211950 (10fdans) p:05Triage→03Normal [17:29:21] i have not seen it [17:29:22] ok cool [17:30:04] ottomata: But worth checking when we'll think about tooling I guess [17:30:08] ay [17:30:09] aye [17:30:21] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Nuria) Something like: Number.parseFloat(3.0).toFixed(2).toString(); [17:33:25] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10fdans) a:03Milimetric [17:33:38] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Ottomata) Jaaaa, we have to fix this client side. It's part of the default Javascript JSON serializer. The only way to fix it is to use a... [17:33:51] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10fdans) p:05Triage→03Normal [17:34:06] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10fdans) [17:34:20] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Ottomata) toString won't work, as it will then encode the value as "1.0", and Hive will see it as a string. [17:35:57] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add a tooltip to all non-obvious concepts like split categories, abbreviations - https://phabricator.wikimedia.org/T177950 (10fdans) [17:35:59] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Implement tooltips with async text load - https://phabricator.wikimedia.org/T211990 (10fdans) 05Open→03Resolved [17:39:53] 10Analytics, 10Analytics-Kanban: Port our "fixes" for python 3.5 to our superset fork https://github.com/wikimedia/incubator-superset - https://phabricator.wikimedia.org/T211932 (10fdans) p:05High→03Normal [17:40:05] 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10fdans) a:03elukey [17:40:22] 10Analytics: Clean up home dirs for users jamesur and nithum - https://phabricator.wikimedia.org/T212127 (10fdans) a:03fdans [17:44:54] hi ottomata [17:45:39] I'm looking at EventLogging::logEvent(), and that has a parameter to omit the user agent. [17:46:13] would that suffice? Or am I missing something? [17:48:24] ottomata: gone for diner, will be b [17:48:27] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Nuria) @Ottomata no, we can fix this via client side cause we can custom serialize types [17:48:31] back after to test presto [18:03:38] kostajh: hm, interesting, I don't know much about how that works, from what I knew the userAgent was collected from the header on the server side [18:03:43] milimetric: do you know about that parameter? ^^^ [18:04:22] it is, but if you pass OMIT_USER_AGENT as an option, it won't get added [18:04:38] bbl, lunch [18:04:41] 10Analytics, 10Analytics-Kanban: Port our "fixes" for python 3.5 to our superset fork https://github.com/wikimedia/incubator-superset - https://phabricator.wikimedia.org/T211932 (10Nuria) >I'll create a staging environment next quarter and then hopefully 0.29 will be a bit more stable, if not we can keep the f... [18:10:18] kostajh: i'm looking at that code, it looks like that is for mediawiki php server side events only...and I think that code route might not be used by mediawiki anymore [18:10:27] hmmm [18:10:31] maybe it is, it is using sendBeacon there [18:10:50] ottomata: yes. We use it for editorjourney which is server side [18:11:08] That’s what Adam was asking about, not help panel (which is all client side) [18:11:09] ah! [18:11:11] ok didn't realize that [18:11:21] huh...i wonder how that works... [18:11:35] for client side eventlogging, which goes thorugh the beacon [18:11:43] only the event is sent [18:11:47] not the capsule fields [18:12:03] afaik the eventlogging server doesn't know how to process already encapsulated events.... [18:12:06] but maybe it todes... [18:14:33] ahhh hmmmmmm [18:14:49] ah no i am wrong! [18:14:56] the capsule is partly filled out by the client [18:14:57] ok cool [18:15:19] cool :) [18:15:58] then kostajh i think huh... [18:16:18] i'm not sure what happens... if the server doesn't set $encapsulated['userAgent'] here https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/a78824e09a43c820d1b1fab06a974d9822d22d7f/includes/EventLogging.php#L71 [18:16:28] i think the eventlogging server will it in from the request header [18:16:44] might need to test that tho [18:16:49] i think tho, if instead [18:16:50] the code did [18:17:20] if omit_user_agent, $encapsulated['userAgent'] = ''; it might work [18:19:30] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Nuria) Something like: var f = function(k,v){ console.log(typeof v); if (typeof v == 'number') { return Number.parseFloat(v).toFixe... [18:19:39] joal: btw, in the sre meeting today, guiseppe asked 'what is presto', and i was able to send him the link to your doc [18:19:42] thanks so much for that! [18:20:31] milimetric (cc fdans mforns_brb) *i think* this works: https://jsfiddle.net/yuLqr5cx/10/ [18:21:23] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Ottomata) OH, JSON.stringify has a function arg? I DID NOT KNOW THAT! That would work. [18:22:49] 10Analytics, 10Analytics-EventLogging: EventLogging client side serialization saves integer decimals as decimal-less numbers - https://phabricator.wikimedia.org/T211983 (10Ottomata) We'd only want to do that for numbers that are already floats. E.g. 1 should say 1, and 1.0 should stay 1.0. [18:41:09] ottomata: I'm glad if feels useful - I'll also very much like feedback from Giuseppe if he has some :) [18:41:35] ottomata: ready for some presto, or later? [18:52:55] * elukey off! [19:24:28] joal: just finished a meeting [19:24:44] gimme 15 mins for coffee and checking up on re-refine [19:25:48] ottomata: will be here :) [19:37:22] nuria: the problem is you need it to be {"other": "some", "value": 3.0}, without the quotes around the float [19:37:33] because if you send a string, that will fail validation too [19:40:44] milimetric: we can do that but i am not sure what you are saying is correct, the json blob is read like a string from the raw data storage and after casted to types pre-defined on the struct , no? cc ottomata [19:42:27] nuria: the whole blob is a string, sure, but each value inside the json is not a string, otherwise there would be no way to tell the difference between a string "3.0" and a float 3.0 [19:43:14] the hard part that Marcel pointed out is to get it to output 3.0, when it's possible JS's internal representation of that doesn't really know about the .0 part [19:44:41] milimetric: Number.parseFloat(v).toFixed(1)/1 [19:44:51] milimetric: will return a number [19:45:18] joal: ok let's look! [19:45:23] ok :) [19:45:24] nuria: ya [19:45:32] it needs to come out as you put it in [19:45:38] "value": 3.0 [19:45:40] as dan says [19:45:46] is different than "value": "3.0" [19:46:26] nuria: it returns a number, but Number(3.0) is a number too, the problem is how do you get the silly JSON.stringify to print out a number with the same decimal precision you gave it [19:47:39] ottomata: tested again from ca-worker-1 - Started "presto --catalog hive", then "show schema" --> Failed connecting to Hive metastore: [ca-coord-1.cloud-analytics-eqiad.wmflabs:9083 [19:47:47] aye hence the linked comments milimetric https://www.reddit.com/r/javascript/comments/7l5ood/alternative_to_jsonparse_for_maintaining_decimal/ [19:48:10] ottomata: I think it;s a typo: ca-coord-1.cloud-analytics>>>>.<<< near the bottom [19:48:31] oh hm ya [19:49:05] it sure is! [19:49:33] ottomata: just noticed it when pasted :( [19:50:03] fixing [19:50:42] ottomata: the labsdb_s1 catalog works though :) [19:50:47] nice [19:50:53] But we shouldn't advertise that as presto is gonna break it :) [19:50:55] yeah i manually added that on the nodes, that is not puppetized [19:50:56] haha yeah [19:54:55] ok joal applied and restarted [19:54:58] try now? [19:55:27] WORKS ! [19:55:28] :D [19:55:32] Thanks ottomata [19:55:39] that was easy! [19:55:46] trying to warm up the beast gently :) [19:55:46] joal, let's get a mw history copy going [19:55:47] i'll work on that [19:55:56] can I just grab one of the snapshots as is? [19:56:06] ottomata: my main issue it to get the files synced on the external endpoint [19:56:12] I'll manage downloading them [19:56:20] ? [19:56:21] ottomata: let's use 2018-11 [19:56:26] ha ya [19:56:28] ah* [19:56:48] joal this? /wmf/data/wmf/mediawiki/history/snapshot=2018-11 [19:56:51] ottomata: If you can do a manual rsync for once, let's try [19:56:58] yes please [19:57:12] milimetric, nuria, I think there is only 2 ways to workaround the 1.0->1 problem: 1) Make instrumentation handle it (by passing a type, or adding Number.EPSILON to the value) 2) Make the schema handle it (either in EL client or in Refine) [19:57:40] we need to handle it before [19:57:46] either el client [19:57:58] or worse, in eventlogging parse / eventgate parse [19:58:04] mforns: since we have 3 clients (php, ios, android ) besides js we will need to make server side code handle it regardless [19:58:10] noooo [19:58:12] :) [19:58:17] jajaja [19:58:29] that's basically special processing for invalid data [19:58:38] this is a js problem :D [19:58:50] do other JSON serializers do the same thing? [19:58:53] guess we shoudl check [19:59:06] ok joal 663G [19:59:07] oh boy! [19:59:08] :) [19:59:12] ottomata :) [19:59:18] ottomata: query failed on my try [19:59:23] failed: java.net.UnknownHostException: cloud-analytics [19:59:26] we have room on thorium, i'm going to put it there [19:59:34] hmmmmm [19:59:35] k ottomata [19:59:37] ottomata, the problem is that in JS there's no way of, given a number, determine if it's an integer [20:00:21] mforns: not sure if these work, but i was going to investigate them [20:00:34] https://github.com/MikeMcl/decimal.js/ [20:00:34] https://github.com/josdejong/lossless-json [20:00:34] https://github.com/wadey/decimaljson [20:00:34] https://github.com/MikeMcl/bignumber.js/ [20:00:48] we kind of have a way.......with json schema [20:00:52] since jsonschema has integer [20:00:57] ottomata: I just tried all those, cc mforns, sadly they also can't tell, I think it's a DEEP js problem [20:01:03] oh no [20:01:06] none of those work? [20:01:08] there might be a way to do it still [20:01:19] but no, those don't seem to solve this particular problem [20:01:21] right, if we have the schema yes [20:01:39] crap crackers...we might have to do this in eventgate or something [20:01:48] it wouldn't be that hard [20:01:49] just sucks [20:01:59] because if you JSON.stringify({"some": CustomNumberClassFromTheREDDIT_thread(3.0)}) it spits out a string [20:02:01] we'd enforce that if a schema just says number [20:02:02] its a float [20:02:08] if it says integer, its an int [20:02:24] milimetric: what happens if you have a json string like [20:02:26] value: 3.0 [20:02:38] what does JSON.parse do? [20:02:43] i guess i will find out for myself... :) [20:02:54] that'd be invalid, you mean {"value": 3.0}? [20:03:03] also, parse is not the problem, right, it's stringify [20:03:09] > JSON.parse(s); [20:03:09] { value: 3 } [20:03:10] NOOOO [20:03:12] IMPOSSIBLE [20:03:16] its both ways [20:03:17] haha [20:03:31] :) yeah, JS numbers are comically screwed up [20:03:40] man oh man [20:03:44] that is so nasty [20:03:47] ottomata, in JS 1 and 1.0 are exactly the same, 1 is represented as a floating point number [20:03:57] we have to use the schema then [20:04:12] we can force instrumentation to let the client know [20:04:20] agh taht's JS tho [20:04:23] what would happen in e.g. java [20:04:31] when deserializing a json string with 3.0 in it? [20:04:42] I think we're looking at this wrong [20:04:53] I think we should catch this kind of error and just allow it [20:04:54] ok i'm very sniped on this, gotta help joal.... [20:04:56] will come back here and think [20:05:06] ok ok [20:05:07] :] [20:05:12] like, we don't care if someone accidentally sends an int where it should be a float or vice versa [20:05:18] WE DO CARE [20:05:21] numbers are numbers, nobody cares, we should cast both ways [20:05:24] no, not true [20:05:33] ok, i am in meeting , will come back to this [20:05:33] HMMM maybe... [20:05:40] maybe its only a refine schema inference problem [20:05:48] exactly [20:05:48] if we use kafka connect... we will be mapping from json schema [20:05:53] so we will always cast in the right way [20:06:14] ok joal i'm listinnig [20:06:17] yeah, we just need a way to say "this kind of cast is ok", and "this kind of cast is bad" [20:06:18] how did you get that [20:06:20] i want to repro [20:06:21] gimme query [20:06:33] sure [20:07:07] select * from joal.pageviews where day >= '2018-01-01' and day < '2018-01-02' limit 20; [20:07:10] ottomata: --^ [20:11:48] hm [20:12:47] ottomata: as if coordinator didn't have hostnames for its workers [20:13:16] yeah.......hmmmm [20:13:32] i'm not getting anything useful in server logs either [20:13:46] not nice :( [20:15:13] ottomata: more logs? [20:15:27] as in, putting log-level to info? [20:18:07] logs are at info [20:18:10] still looking [20:19:38] haha [20:19:39] "org.eclipse.jetty.util.thread.strategy.EatWhatYouKill" [20:20:05] Explicit strategy :) [20:21:59] haha [20:22:44] OH [20:22:51] this is a hadoop HA prob [20:23:00] hm [20:23:06] ? [20:23:15] i think i know... [20:23:20] it isn't finding all the hadoop configs by default [20:23:28] ottomata: oh - hdfs issue [20:23:31] so it doesn't know what the name 'cloud-analytics' (the name of the hadoop cluster) means [20:23:31] ya [20:23:32] not arn [20:23:40] Yoooooh [20:23:44] Hm - Not nice [20:23:52] ya i think i know how to fix though hang on... [20:24:19] i didn't configure properly [20:24:20] " In some cases, such as when using federated HDFS or NameNode high availability, it is necessary to specify additional HDFS client options in order to access your HDFS cluster." [20:24:21] on it... [20:24:35] Ah [20:24:38] ok :) Thanks :) [20:29:33] getting new error now [20:29:33] Failed to list directory: hdfs://cloud-analytics/user/joal/pageviews_parquet/day=2018-01-01 [20:29:51] At least we have something [20:29:55] checking [20:30:42] ottomata: there is stuff in there [20:31:11] ottomata: permissions should be ok as well (755) [20:31:19] yeah [20:39:39] milimetric: ok, i cannot think of anything to overcome problem with "blah": "3.0" that does involve serializing that as a string [20:40:21] yeah nuria the only option might be to like edit the output of stringify with a regex or something crazy like that [20:40:57] the main problem is that 3.0 === 3 [20:41:02] milimetric: ya, like move "blah":"3.0" to "blah":3.0 [20:41:25] milimetric: even 3.0 ==3 (no triple equals) [20:41:35] yeah [20:42:05] I think the easier solution is just to not be strict on the validation and allow ints as floats [20:42:19] or more generally: upcasting is always ok [20:42:25] downcasting can still throw errors [20:42:39] that should be an option in any decent schema validator, actually [20:47:49] milimetric: i think that is going to be needed, i wonder ... [20:59:39] ottomata: no news I assume? [20:59:42] no news [20:59:56] thought it was maybe a hadoop version issue; hive connector uses a hadoop 2.7 jar [20:59:58] but it seems to work fine [21:00:04] i can use it to access hadoop fine [21:01:15] gonna try some debug logging [21:03:30] THAT HELPED [21:03:35] ok User: presto is not allowed to impersonate otto [21:03:37] getting somewhere! :) [21:06:30] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add a tooltip to all non-obvious concepts like split categories, abbreviations - https://phabricator.wikimedia.org/T177950 (10Nuria) [21:06:32] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Tooltips on mobile sometimes interfere with the interface - https://phabricator.wikimedia.org/T212023 (10Nuria) 05Open→03Resolved [21:06:54] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add a tooltip to all non-obvious concepts like split categories, abbreviations - https://phabricator.wikimedia.org/T177950 (10Nuria) 05Open→03Resolved [21:07:58] 10Analytics: Percentage increase should be removed from"all" time range on wikistats UI - https://phabricator.wikimedia.org/T205809 (10Nuria) [21:18:52] THERE WE GO JOAL [21:18:53] joal [21:18:55] try now [21:18:58] Yeah :) [21:19:04] i've manually done some fixes, will apply via puppet now [21:19:26] Working indeed :) [21:19:30] Yay ! [21:19:36] Testing a bigger query :) [21:20:19] joal: cool [21:20:19] http://ca-presto.wmflabs.org/ui/ [21:20:21] can see it ramping up [21:20:38] nice UI ottomata `:) [21:22:28] btw joal am busy copying here https://analytics.wikimedia.org/datasets/one-off/mediawiki_history/snapshot=2018-11/ [21:22:35] not done yet [21:22:46] I can imagine it takes time [21:22:54] its going through fuse mount.... [21:23:38] might be nicer if we productionize to tar it up (in hadoop hopefully) [21:35:03] ottomata: I don't think hadoop can tar [21:35:05] :( [21:37:51] haha, but we can shell out and then hdfs put [21:38:04] i guess its not that much, we can store temporarily on a statbox [21:38:41] oh. har [21:38:43] hadoop archive ? [21:38:48] 10Analytics, 10Operations, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10chasemp) 05Open→03Resolved a:03chasemp [21:39:06] I don't know that lad [21:39:30] https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html [21:40:33] huh interesting [21:40:40] i wonder if you can add a hive partition with a har:// uri [21:40:44] and if it would just work [21:41:14] Would be fun :) [21:42:30] ottomata: https://gerrit.wikimedia.org/r/c/operations/puppet/+/480248 [21:44:33] ottomata: pageview-top (crosswikis) - 1 day: 40s - 7 days: 5:30s - 31days: resource limit [21:46:33] chasemp: i got no problem with that, even if not tempoprary [21:46:37] exfat-fuse tho? [21:46:58] oh you gonna make an image or something? [21:47:12] ottomata: finding daily pageviews for a single page over 30 days: 27s [21:47:17] not bad [21:47:23] ottomata: gotta format an ext drive and then mount etc, I think util comes with format utils and fuse is for mounting? [21:47:27] joal: that is awesome [21:47:33] k [21:47:35] ya makes sense [21:47:39] +1ed chasemp [21:48:13] tx ottomata [21:49:56] for fun I was having a look at our favorite top pages nuria - For the month of June, there is A LOT more views from mobile than desktop [21:50:16] joal the history files are being copied in order mostly, its still going, but when you see the final file part-04095 [21:50:18] it shoudl be done [21:50:35] oh its only at 01859 [21:50:37] got a while... [21:51:28] ottomata: will download tomorrow morning :) [21:52:01] i think it will take about a couple more hours [21:52:03] ok joal sounds good [21:54:11] Gone for tonight team - Thanks ottomata for making the beast work - Hopefully we'll test tomorrow :) [21:55:08] laters joal ! [22:14:05] joal: for xhamster? [22:14:08] or all? [22:14:26] ah sorry joal, i see you are gone for today [22:27:05] 10Analytics, 10Research, 10WMDE-Analytics-Engineering, 10User-Addshore, 10User-Elukey: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410 (10Neil_P._Quinn_WMF) >>! In T172410#4828023, @elukey wrote: > I would use the middle of March as soft deadline, since the... [22:31:49] !log restarted Turnilo to clear deleted datasource [22:31:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:50:57] !log restarted Turnilo to clear deleted datasource [22:50:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:18:55] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Mediawiki history has no data on IP blocks - https://phabricator.wikimedia.org/T211627 (10Neil_P._Quinn_WMF) [23:18:57] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: mediawiki_history datasets have null user_text for IP edits - https://phabricator.wikimedia.org/T206883 (10Neil_P._Quinn_WMF) [23:19:00] 10Analytics: Provide edit tags in the Data Lake edit data - https://phabricator.wikimedia.org/T161149 (10Neil_P._Quinn_WMF) [23:20:44] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics: mediawiki_history missing page events - https://phabricator.wikimedia.org/T205594 (10Neil_P._Quinn_WMF) [23:22:46] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Add partial blocks to mediawiki history tables - https://phabricator.wikimedia.org/T211950 (10Neil_P._Quinn_WMF) [23:24:53] 10Analytics: Provide historical redirect flag in Data Lake edit data - https://phabricator.wikimedia.org/T161146 (10Neil_P._Quinn_WMF) [23:52:54] !log restarted Turnilo to clear deleted datasource [23:52:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log