[06:41:13] Hello! Earlier metrics.wikimedia.org used to have a download option for invalid wikimedia ids. How to do the checking from my side? And is it possible to have such feature any where? [14:26:25] (CR) Ottomata: (WIP) project class/variant extraction UDF (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 (owner: OliverKeyes) [14:37:06] milimetric, around? [14:37:19] hi yurik [14:37:21] hey [14:37:22] morning [14:37:25] need your help ) [14:37:27] k [14:37:34] a small help and a big help ) [14:37:52] uh oh, this is a lot of build up :) [14:37:58] )) [14:38:05] will start with big: [14:38:06] I got stand up in 20 min. but i'm yours till then [14:38:22] there is one major issue preventing release imo [14:38:57] namely, there is no way in vega to supply a function with the rendering request to fix URLs [14:39:17] in other words, i need to resolve relative URLs somehow [14:39:33] Analytics, Analytics-Cluster, Analytics-General-or-Unknown, Analytics-Kanban, Datasets-General-or-Unknown: pagecounts stats are behind by about 16 hours - https://phabricator.wikimedia.org/T89771#1058852 (Ottomata) Open>Resolved a:Ottomata [14:39:36] is the way you're doing it now hacky? [14:39:41] i saw it was working but i didn't dig into it [14:39:47] e.g. /wiki/Blah?action=raw needs to be prepended with the server [14:39:51] yea [14:39:51] well, it does work [14:40:06] but i suspect it only works because i don't have concurrent multi-server calls [14:40:20] Analytics-Cluster, operations: Audit analytics cluster alerts and recipients - https://phabricator.wikimedia.org/T89730#1058856 (Ottomata) p:Triage>Normal [14:40:21] node will handle thousands [14:40:27] i'm somewhat familiar with the code that parses the spec, lemme see if there's a way to make a plugin or something [14:40:41] exactly - but it has to be PER request [14:40:44] it cannot be global [14:40:53] hm? [14:41:06] (PS1) Mforns: [WIP] [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/192319 (https://phabricator.wikimedia.org/T89251) [14:41:08] i have to supply a function as one of the parameters [14:41:12] (CR) jenkins-bot: [V: -1] [WIP] [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/192319 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [14:41:13] to rendering [14:41:22] or something declarative like: [14:41:39] hm... [14:41:51] basically when my function resolves the URL, it has to know the original context [14:41:58] of the request [14:42:21] right, ok, one sec [14:42:28] btw, did you see my modifications to the vega? [14:42:46] i mean, worst case we can just pass the original request in a transformRelativeUrls (spec, request) function [14:42:54] Hello! Earlier metrics.wikimedia.org used to have a download option for invalid wikimedia ids. How to do the checking from my side? And is it possible to have such feature any where? [14:43:01] yes, there was a lot so I didn't look closely [14:43:10] was that just 'cause you were upgrading to 1.4.3? [14:43:15] er 1.4.4? [14:44:12] this is my repo [14:44:13] https://github.com/nyurik/vega [14:44:36] i was doing some other loading cleanup because it also didn't handle relative protocols [14:45:32] * Ironholds misread as Venga, thought it was more fun than it was :( [14:46:05] :) milimetric, and also i found that url sanitizing was not done when using images [14:46:23] ah, i see [14:46:33] https://github.com/nyurik/vega/compare/trifacta:master...master [14:47:02] yurik: is this as easy as just setting vg.config.baseURL ? [14:47:07] https://github.com/trifacta/vega/blob/master/src/data/load.js#L2 [14:47:23] milimetric, that's the point that i cant do that [14:47:32] config is global [14:47:43] but if you set it on every request... [14:47:46] node will get tons of simultaneous servers [14:47:53] that doesn't matter [14:47:58] it's JS - it's single threaded [14:48:18] it is, but it is massivelly multi-concurrent [14:48:20] once you set it in your request handler, it'll stay set in that context until it returns [14:48:42] worst case you could require('vega') each time... [14:48:45] while it waits for one request's io, it will process others [14:48:46] but you don't need to do that [14:49:53] ok, hm, you may be right [14:50:11] so then maybe just require('vega') for each new request domain, and keep them in a hashmap [14:50:23] ouch [14:50:39] should be ok... this thing takes up like 0 memory [14:51:06] ok, i will try that [14:51:15] in the mean time, can you ansewre the simple question: [14:51:48] for this code: https://github.com/nyurik/vega/blob/master/src/data/load.js [14:51:48] btw, you can test whether you're right or wrong pretty easily, just make it download a huge dataset and make a second request [14:51:56] how do i inject my own sanitizer func [14:52:03] what line? [14:52:17] 14 [14:52:30] i need to overwrite that function from calling code [14:52:42] basically set it to be different [14:52:45] overwrite always or just in some cases? [14:52:48] i could modify the code of the fun itself [14:52:51] always [14:53:11] I think vg.data.load.sanitizeUrl = yours [14:53:17] doesn't work [14:53:24] never gets called [14:53:36] hm... suspicious... how are you testing? [14:55:04] https://git.wikimedia.org/blob/mediawiki%2Fservices%2Fgraphoid/2226f6b4bfe288ca4a670b75a2cee7f1ffdaecd5/routes%2Fv1.js#L112 [14:55:08] milimetric, ^ [14:57:12] I think you want "var url" a couple lines before that [14:57:15] otherwise it's global [14:57:45] but that's not a problem here [14:57:49] milimetric, multiple require won't work [14:57:52] (CR) OliverKeyes: (WIP) project class/variant extraction UDF (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 (owner: OliverKeyes) [14:58:00] i just tried - require('vega') returns the same object [14:58:45] yeah, makes sense hm... http://stackoverflow.com/questions/9210542/node-js-require-cache-possible-to-invalidate [14:59:37] (PS4) OliverKeyes: (WIP) project class/variant extraction UDF [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 [14:59:45] you can delete it from require.cache, but then you're getting dirty [15:00:02] gtg to standup, i'll check out sanitize after [15:00:24] thx, but context-based sanitization is more important )) [15:13:34] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Force Hue https redirects. - https://phabricator.wikimedia.org/T85834#1058970 (ggellerman) [15:14:04] Analytics-Cluster, Analytics-Kanban, operations: Audit analytics cluster alerts and recipients - https://phabricator.wikimedia.org/T89730#1058972 (ggellerman) [15:52:24] (CR) QChris: [C: -1] "Thanks for the review." [analytics/refinery] - https://gerrit.wikimedia.org/r/191118 (owner: QChris) [16:09:43] (CR) Nuria: "Oliver, you have <<< merge markers on the code. Please take a look, I think you might have committed some unmerged code." (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 (owner: OliverKeyes) [16:13:34] (PS5) OliverKeyes: (WIP) project class/variant extraction UDF [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 [16:24:18] gping to cafe bbs [16:48:34] hm, joal [16:48:35] yt? [17:19:08] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Reliable scheduler computes Visual Editor metrics [21 pts] {lion} - https://phabricator.wikimedia.org/T89251#1059375 (Aklapper) [17:19:27] joal yoyo [17:19:39] Yo Man ! [17:20:11] better? [17:20:21] Yes, works now [17:20:32] But I don't get why it didn't previously :) [17:20:33] so, i was pinging you before, because we are probably going to want to integrate these udf results into the refined table too [17:20:41] https://gerrit.wikimedia.org/r/#/c/188588/ [17:20:54] sooo, if we have to recompute the whole table, maybe we should do this one at the same time too ?" [17:21:28] Yup, sounds a good idea [17:22:02] k cool [17:22:12] lemme know if you need help with gerrit stuff [17:22:13] Now a question is : should we recompute everything everytime there is a new interestintg field? [17:22:32] I don't have a good answer here [17:23:00] no [17:23:15] recomputing is expensive and a pain [17:23:28] let's set precedent that we won't [17:23:30] Yes, that's the point [17:23:36] ok [17:24:13] I think sometimes we may _have_ to to correct errors but should not be default [17:24:13] ok [17:24:34] Sounds right [17:24:50] ottomata, you're with us here ? [17:25:30] To prevent recompute, the idea is then to add udf's for users to use [17:25:41] Whenever possible [17:26:33] tnegrin: psshHHHH [17:26:52] tnegrin: we are talking about adding fields to the refinfed table [17:26:53] things like [17:27:14] client_ip, geocoded fields, etc. [17:27:17] is_pageview was one of these [17:27:57] why would we recompute tho? [17:27:58] hive (+parquet) doesn't support addition of fields well, so when we want to add fields, there are 3 options: [17:28:12] 1. drop all the old data and just let newly refined data have the new fields [17:28:35] 2. recreate the refined tables with null for the old data, and let the refine process insert newly computed data [17:28:37] or [17:28:52] 3. recreate the refined tables with the new fields computed, [17:29:52] * Ironholds raises hand [17:29:54] suggestion? [17:30:04] ok -- we can't drop the old data [17:30:06] let's wait on the recomputing until we have the pageviews definition locked and loaded [17:30:09] we can [17:30:11] because we do! [17:30:11] :) [17:30:15] that way we only have to recompute once for three fields, yes? [17:30:19] in the parquet table? [17:30:20] sure, but when we have it tested ;p [17:30:26] the QA is still going. [17:30:37] Ironholds: one thing at a time [17:30:40] Ironholds: i'm fine with waiting, but in general, we wouldnt' recompute fields like is pageview if the def changes [17:30:46] we are only recomputing because of a schema change [17:30:52] ..hangon. Meeting [17:31:10] ottomata: what happens when we can't recompute because we've deleted the source data? [17:31:18] too bad? [17:31:23] that's what we do at wmf :) [17:31:43] my point is not "let's recompute every time the PV def changes" [17:31:46] does that mean we lose the old data? that seems unfortunate [17:31:48] Ironholds: agree. [17:31:48] tnegrin: i'm not talking about recomputing generated datasets [17:31:52] it's "the new def is still a prototype. Let's wait until it's not a prototype. [17:31:59] which is, like...the end of the week. [17:32:03] Ironholds: please stop talking [17:32:08] just for a sec [17:32:10] i'm talking about the data in the webrequest tables [17:32:15] which, we delete anyway [17:32:26] ok -- but not the parquet tables? [17:32:30] the refined table? [17:32:32] ya sure. [17:32:33] yes [17:32:41] we delete that too (or rather, we will) [17:32:45] it has private data [17:32:50] mmm…ok [17:33:04] so there will be a summary table for aggregates [17:33:17] ja, e.g. pagecounts-all-sites [17:33:22] ok cool [17:33:52] so we agree that we just nuke the old data in the raw and refined tables when we recompute? [17:34:06] only refined [17:34:16] ok -- makes sense [17:34:16] well, if we nuke it, that would mean that we would,n't have the last 30(-90) days of data for researchers to work with [17:34:31] if um, by nuke you mean delete. [17:34:33] that is option 1 [17:34:41] which isn't the ideal option, i think [17:34:45] it is the easiest for us [17:34:46] :) [17:34:53] but i don't think researchers would like that [17:35:08] perhaps we create a new table for the new schema [17:35:08] we'd be all: hey, we changed the schema, so you will only have future data to work with [17:35:13] yes, that is an option. [17:35:19] well [17:35:26] why don't we do that? then the old table will just be deleted eventually [17:35:27] sure, we could version the table names. [17:35:49] rigiht now, i think i would prefer if we renamed the old table, and made a new one that gets stuff inserted into it, adn we just told people about it. [17:36:00] that way running oozie jobs woudln't have to be restarted [17:36:03] works for me [17:36:20] joal, that's fine with me too, if we don't actually recompute the old data, but just kept the old table around as a different name [17:36:36] Ok, but I still want to press the fact that we'll recompute a big bunch of stuff [17:36:38] we have to make sure that oozie jobs are done with it before we rename it...or just backfill the new table with just a little bit of data [17:36:40] maybe a days worth [17:36:42] oh? [17:36:43] why? [17:36:47] We don't want to do that every morning :) [17:36:49] if we didn't have to backfill at all [17:36:52] it would just be [17:36:59] rename webrequest to webrequest_old; [17:37:04] create new webrequest [17:37:08] change refine job. [17:37:09] that's it, no? [17:37:41] ok did'nt get it [17:37:48] Find by me as well [17:37:51] fine soory [17:37:57] then eventually, drop webrequest_old; [17:37:57] :) [17:38:03] yop [17:38:07] yeah, i think we should backfill a day though...or, maybe not, hm. maybe just be really careful [17:38:13] the oozie jobs lag behind by about 2 hours [17:38:28] so, we need to make sure they don't miss those 2 hours of data if we just renamed the table [17:38:38] Yes [17:38:54] if we just backfilled the new one by a few hours, that would solve that problem, i guess? [17:38:56] Maybe launch parallel ozzie jobs for a day or two is a good idea [17:38:59] oof [17:39:02] :) [17:39:03] doesn't sound like fun [17:39:05] ok [17:39:09] oozie wranglin is not my favorite passtime :p [17:39:17] Then back fill by hand, and stop/start oozie ; [17:39:31] sound right [17:39:34] if we backfill a few hours before we rename, we don't have to start/stop oozie... [17:39:36] pretty sure anyway [17:39:43] I understand [17:39:45] i guess hm, not sure what happens to a running query if a rename happens [17:40:04] ja, we can work out the details around that [17:40:42] Ok, so let's plan the thing, and then communicate about the coming change [17:40:45] right ? [17:41:16] aye [17:52:30] I'm leaving for 90 minutes (taking my dog to the vet) [17:52:33] By the way, since we rename, let's also be good about naming conventions here: refined_webrequest_v1 for instance ? [17:53:25] joal: you mean the old table? [17:53:33] yup [17:53:40] and obviously, the one as weel ;) [17:53:45] sure, we can call that whatever, you could match the version with the refinery version, if you like [17:53:49] ah the new one? [17:53:57] that might be a good idea, i'm not certain. [17:54:15] maybe webrequest_refined_current ? [17:54:27] Just to position the fact that it might change [17:54:41] why not just have the convention that webrequest is the current one? [17:54:47] This naming also facilitate documentation [17:54:51] hm [17:54:55] ok, good for me :) [17:55:13] It doesn't tell how many versions before, but that's ok [17:55:17] in general, i'm not sure i like the idea of versioning the main table, as then we have to change jobs, etc. but maybe. [17:55:38] what about a field? [17:55:49] that said what version a record was generated with? [17:56:13] that way, if we change the pageview def that oliver is talking about, it would be clear what was used [17:56:14] Why not, but doesn't change the schema problem [17:56:26] ja, doesn't help with schema changes [17:56:30] but would help with documentation, as you say [17:56:35] Then we get a major/minor version [17:56:40] major : schema change [17:56:49] minor, definition change only [17:56:54] ja, but so much has to change in order to change the table name, including telling people where to look when [17:57:22] it is parameterized in oozie jobs but uhHnnngn. so many things :/ [17:57:41] Yeah, but I prefer communicating around change and having people have to change there thing instead of breaking stuff [17:57:53] yup, I understand [17:58:14] ha, i mean, i know you are right, overall it is the more correct thing to do, as it makes everything explicit [17:58:34] but uhhhhh i suppose if you want to take on making sure things are cool and running with version changes, I'm not gonna stop you :) [17:58:40] My point here is : if it changes, clients should update if they want to [17:58:55] they will have to anyway, as eventually their old version table will be deleted [17:59:16] Yes, and data won't flow in [17:59:26] But it doesn't break :) [17:59:34] errghghhh, i dunnoOOoooooo [17:59:38] this is a hive problem! [17:59:49] if we were using avro, we could deal with backwareds compatible schema changes! [17:59:51] not only, this a schema versionning proble :) [18:00:16] schema versioning yes, but the version should be on the record, the only reason we are changing the data access method (table name) is because of hive [18:00:31] true [18:00:49] NoSQL vs SQL, kind of ;) [18:01:11] has, kind of ja [18:01:12] ha [18:02:11] hm, i'm not full against this, but couldn't we just have good documentation around this, rather than making people change code and workflow everytime? [18:02:22] announcements, etc.? [18:02:36] Good documentation is core [18:02:44] anouncements are important as well [18:03:15] I also hate the idea of having to version everything etc [18:03:26] But Ithink there is no other way really [18:03:35] Data will change, no doubt on that [18:04:20] aye, but versioning the records is not enough? [18:04:52] should be, in self-contained-schema world [18:05:23] i think in most cases, the changes in the table will be backwards compatible, its just the hive doesn't let us alter the table in place. [18:05:41] so, in most cases, the schema change will not affect users' running code [18:06:00] I do agree [18:06:08] We will mostly add, not remove [18:06:24] Add usually hive allows adding fileds [18:06:27] if records are versioned, and they notice a difference in their output, they can check for version changes to help explain why [18:06:35] It's hive + parquet which doesn't [18:06:53] Versioning filed is good for sure [18:07:06] Now how do we handle the hive thing, this is another issue [18:07:16] hm, reading this now: https://github.com/Parquet/parquet-format/issues/91 [18:07:17] : [18:11:10] hm i dunno, maybe we shouldn't let hive bend us to its will? [18:11:12] :D [18:11:28] i think we handle the hive thing with an annoucement [18:11:38] we won't have to do this often, if we version the recods anyway [18:11:47] yeah ... [18:11:58] I still don't like it very much [18:13:25] I'll think about it a bit more [18:13:38] And by the way, no way to get git reiew working [18:13:42] grrrrr [18:13:42] no? [18:13:45] what's up? [18:14:04] ----------------------- [18:14:04] Problems encountered installing commit-msg hook [18:14:04] The following command failed with exit code 104 "GET https://gerrit.wikimedia.org/tools/hooks/commit-msg" [18:14:07] ----------------------- [18:15:12] Because of anonymous clone I guedd [18:17:44] yup, found it [18:17:57] ah ja [18:18:19] i think it gets the hook by sshing into gerrit.wikmiedai.org, and if you don't have the ssh url there it might not work [18:19:02] Sounds reasonnable :) [18:27:34] (PS1) Joal: Add client ip and geocoded data to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/192363 [18:27:56] Huuray ! [18:29:33] :) [18:30:14] HMM [18:30:16] joal! [18:30:23] tell me [18:30:32] i think we can do this, without renaming and without recomputing. [18:30:39] just tried this [18:30:40] ALTER TABLE parquet_testA [18:30:40] ADD COLUMNS (`new_column` string) [18:30:40] ; [18:30:45] now [18:30:56] that does not change the parquet file schema for existing data [18:31:07] Try elect some data now [18:31:11] in the new table [18:31:15] so [18:31:16] yes [18:31:28] it will not work if you try to access the new field on old partitions that have the old schema [18:31:33] but, as long as you don't do that, its fine [18:31:47] Let me double check that [18:31:58] triple checking too [18:34:47] Ok, got it ! [18:35:05] 'SELECT *' won't work anymore, but that's all [18:35:15] it should thoug, no? [18:35:17] just not on old partitions [18:35:41] correct [18:36:55] Usual hive way is to use NULL as default value for columns that does not have values (at the en of the row) [18:37:04] Here with parquet, it beaks [18:37:22] But you good man :) [18:37:30] Will prevent us from some trouble [18:37:52] hah, ja, makes sense, because the parquet schemas are on the files themselves [18:37:55] However, we should still have a vcersion field ;) [18:38:03] and really, hive treats partitions almost as individual tables [18:38:05] agree. [18:38:19] ok, that's good news :) [18:38:30] so, as long any single partition doesn't have both schemas in it, all is well [18:38:58] ha, you can even run alters statemnts on particular partitions [18:39:02] which woud,n't make sense in our case [18:39:06] since this is an external table [18:39:21] all the alter table will do is make new parquet data written by hive have the new schema [18:39:37] Yes [18:40:56] Alter statement for partitions are not schema oriented though [18:49:39] ori: yt? [18:58:17] (CR) Nuria: [C: 1] "Changes look good provided that we have test them on the cluster. I leave it up to Andrew to merge them." (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/192363 (owner: Joal) [19:02:12] nuria: hey [19:02:25] ori: hola amigo [19:02:56] ori: first things 1st, whenever you can please take a look at these two: https://gerrit.wikimedia.org/r/#/c/192332/ [19:03:12] and https://gerrit.wikimedia.org/r/#/c/191231/ [19:03:26] and 2nd: we can talk about vanadium replacement if you have a few minutes [19:08:40] (CR) Nuria: (WIP) project class/variant extraction UDF (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 (owner: OliverKeyes) [19:08:46] nuria: the current rate of NavigationTiming events is 7 per second [19:10:34] ori: at peak maybe but sustianed is a lot lower: http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1424718595.846&target=eventlogging.schema.NavigationTiming.rate&from=00%3A00_20150207&until=23%3A59_20150223 [19:11:08] nuria: but you say "at least 1 per second" and i don't see it dropping below that in your graph [19:12:21] ori: right right... should say >1 event per sec sustained [19:12:42] well, that condition is still met by the current configuration [19:13:34] i'm looking at the last few months and i don't see it dropping below that threshold except when it was actually down [19:17:50] milimetric, hi, any luck? [19:18:06] ah, side-tracked [19:18:32] i'll try sanitize now [19:18:47] ori: it's happens in a few occasions that we have gone < 2 and larms get trigger [19:19:01] *alarms [19:20:08] nuria: you keep changing the desired theshold! :P now it's < 2 [19:20:10] ori: but alarms are <2 per sec .. which ahem.... you are right it's not [19:20:11] make the alarms less sensitive? [19:20:18] what i explained on the commit message [19:21:05] ori: even when the rate have been so up to date [19:21:20] ori: cause the alarms thresholds were set months ago [19:21:51] ori: if you are not concerned on data being more sparse alarms can be lowered that is fine too [19:22:58] yes, thanks [19:24:41] ori:ok [19:30:16] ori: done: https://gerrit.wikimedia.org/r/#/c/192375 [19:32:49] nuria: can you document the rationale in the commit message? [19:33:04] ori: sure, wip [19:34:08] yurik: yeah, there's a bug in your vega code. You needed var url = vg.data.load.sanitizeUrl(uri); on this line: https://github.com/nyurik/vega/blob/master/src/data/load.js#L61 [19:34:42] instead of just plain "sanitizeUrl" because that's putting the sanitizeUrl function in a closure and not letting you change it outside [19:35:08] basically, you accidentally made a private function [19:35:19] milimetric, thanks! will fix. What about more general case? How hard would it be to introduce context into rendering? [19:35:45] I'm going to look at their binary executables, have you seen those? [19:36:26] aha! yurik this would solve your problem, no? [19:36:42] it's got vg2png and vg2svg, you can just pass -b http://base/ to it [19:37:02]