[01:19:56] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3224615 (10kaldari) [01:20:37] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3224629 (10kaldari) @Milimetric: Any thoughts on items 5 and 6 in the task description? [01:21:26] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3224615 (10kaldari) [10:08:21] 06Analytics-Kanban: Update druide uniques to only contain projects having more than 1000 uniques - https://phabricator.wikimedia.org/T164183#3224787 (10JAllemandou) [10:08:32] 06Analytics-Kanban: Update druide uniques to only contain projects having more than 1000 uniques - https://phabricator.wikimedia.org/T164183#3224799 (10JAllemandou) a:03JAllemandou [12:08:10] taking a break a-team [14:00:19] hiiii joal yt? [14:30:47] 10Analytics: Unique Devices on Pivot, initial screen should not add values by default, is this configurable? - https://phabricator.wikimedia.org/T164194#3225083 (10Nuria) [14:31:32] 06Analytics-Kanban, 13Patch-For-Review: Add daily unique devices dataset to pivot - https://phabricator.wikimedia.org/T159471#3225100 (10Nuria) >Thanks, this is great! Unfortunately, in the current setup there is a lot of potential for confusion, because on the initial view it will show the sum over the unique... [14:32:02] Hey ottomata ! [14:34:36] hiiiii, was going to ask you a quick q about https://gist.github.com/jobar/91c552321efbedba03c8215284726f88, but i have another q now too... :) [14:34:43] oh, but i answered my first q already [14:34:48] it was about some scala stuff, but i figured it out [14:34:58] hmm, actually, will ask a diff version of it [14:35:02] is it possible to do something like [14:35:28] val Map[String, Seq[StructField]] = ... [14:35:33] val m = ^ [14:36:09] m.map((name, field) => ... ) [14:36:11] or do I have to do [14:36:17] m.map { case (name, field) => ... } [14:36:17] ? [14:36:24] milimetric: did rebase and indenting corrections, if you merge we should be reday to deploy [14:36:28] *ready [14:36:37] milimetric: hola! sorry [14:36:41] it seems it would be nice to get named function parameters in anon function without having to explicitly use case to pattern match them [14:37:10] or, i fi had a list of Tuple2 [14:37:12] same thing [14:37:19] list.map((key, val) => ...) [14:37:20] vs [14:37:26] list.map { case (key, val) => ... ) [14:37:29] } [14:38:25] ottomata: |I think you need to pattern-match [14:38:51] ya, k, rats [14:38:56] other q then, [14:38:59] sorry :) [14:39:09] in your buildDenormalizationLookup [14:39:12] i would be using this from merge [14:39:20] and i need to use it with field names from both original and new schema [14:39:30] s1, s2 [14:39:44] but, it is ipossible that s1 may have a field called "fieldName" [14:39:50] and s2 might have one called "fieldname" [14:40:13] if always normalizing, this is fine, as those two will be considered the same field [14:40:15] 10Analytics, 10Analytics-General-or-Unknown: Provide regular cross-wiki reports on flagged revisions status - https://phabricator.wikimedia.org/T44360#3225113 (10Nuria) Seems that this is related to our conversations about "measuring community backlog" cc @milimetric [14:40:19] but, if denormalizing [14:40:33] in the map lookup, i'd end up with [14:40:59] { [14:40:59] fieldame -> fieldName, [14:40:59] fieldname -> fieldname [14:40:59] } [14:41:10] with two entries for the same key (?) [14:41:26] or will I end up with just the latest fieldName i end up looking at [14:41:36] ottomata: only the latest [14:41:38] aye [14:41:49] ok, this is actually a problem for the non-recursive denormalize i was doing too [14:42:05] ottomata: correct [14:42:05] not totally sure what to do here, i guess we just need to pick a behavior [14:42:15] ottomata: I think we should always denormalize [14:42:29] ? [14:42:32] normalize sorry --^ [14:42:34] oh ok [14:42:35] right [14:42:36] but [14:42:48] we NEED to denormalize, in order to re-read the JSON data from hdfs using the schema [14:42:48] I can't recall why wqe needed denormalization though [14:42:56] if JSON has fieldName [14:43:00] a schema with fieldname [14:43:03] Yes ! [14:43:05] will end up setting fieldname = NULL [14:43:06] I got it now [14:43:33] hm... hm [14:43:42] so, ya, i guess we can just say if s1 and s2 have lower case and non lower case versions of the same field [14:43:52] when denormalizeing, we choose one? [14:43:59] we choose the nonlower case? [14:44:26] ottomata: we should actually raise an error if we have 2 different fields that coalesce to a signle normalized one, no ? [14:45:08] hm [14:45:26] this would happen if someone renames a field in a json schema [14:45:31] from nonlower to lower [14:45:32] ottomata: similarly as type changing, knowing we need to lowercase fieldsname, it sounds reasonable to ask not to use case-misleading fields name [14:45:35] and those events end up in the same partition [14:46:31] which, seems totally like something somebody might do [14:46:32] ottomata: in the specific case you mention, I'd agree to check for f.lowerCase != f before inserting in the lookup map [14:46:44] joal: in this case, we can choose something, and continue to work [14:46:45] id' [14:46:53] i'd lean towards handling it if we can [14:47:15] that way, if it happens, the data will continue to be refined and queryable, and if it causes weird data problems, we'll know exactly why [14:47:19] i'd log some warnings though [14:47:22] ottomata: My concern is that it makes the thing also work for anti-pattern names [14:47:40] hmmm [14:47:41] but it's very feasible indeed [14:48:12] true, but what if this is for some client side sent data, for which some have an old version with lowercase names, and somem have a new version with uppercase names [14:48:19] we'd likely get that data for a while [14:48:31] and i'd rather have it be at least a little queryable for the other fields [14:48:39] even if this one renamed field might have nulls [14:48:41] in some cases [14:48:45] ottomata: we can check that at schema merge time ... mwarf [14:48:58] oo actually [14:49:02] lemme think [14:49:26] yeah, sorry, it would just be nulls [14:49:34] the data would be missing [14:49:38] milimetric: sorry, just saw you are out today! [14:50:25] joal: at schema merge time? [14:50:47] when merging schemas, we could recursively check for names equivalence - not fun [14:50:54] like, when iterating, check if the current field has a matching normal or nonNormal field in original s1 and check the same in original s2, and then check to make sure those fields are the same case in both? [14:50:56] yeah [14:51:29] if we were going to throw exception, that would be the place to do it [14:56:30] (03PS13) 10Nuria: Changes api glue code to accept a project or array of same [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (owner: 10Fdans) [14:59:02] (03CR) 10Milimetric: [V: 032 C: 032] "it's ok, I merged it from the couch :)" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/347305 (owner: 10Fdans) [15:00:14] coming! [15:01:29] elukey, fdans : stadddup [15:02:24] ah wait, may 1st [15:02:33] nuria: yep! :) [15:20:48] 06Analytics-Kanban: Unique Devices on Pivot, initial screen should not add values by default, is this configurable? - https://phabricator.wikimedia.org/T164194#3225232 (10Nuria) [15:20:54] 06Analytics-Kanban: Unique Devices on Pivot, initial screen should not add values by default, is this configurable? - https://phabricator.wikimedia.org/T164194#3225233 (10JAllemandou) a:03JAllemandou [15:28:27] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3224615 (10Nuria) @kaldari: can you include a link to the source code for bot? Maybe it will benefit from using pageview.js client in terms of speed and asynchronicit... [15:32:50] 10Analytics: Preserve userAgent field in apps schemas - https://phabricator.wikimedia.org/T164125#3222702 (10Nuria) The capsule includes: { device_family: , browser_family: , browser_major: , browser_major: , os_family: , os_major,... [15:42:46] 10Analytics, 10Analytics-Cluster, 10EventBus: Delete stale topics from main Kafka clusters - https://phabricator.wikimedia.org/T149594#3225293 (10Ottomata) Remember to have https://gerrit.wikimedia.org/r/#/c/349280/1 merged first. [15:44:48] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3224615 (10Niharika) >>! In T164178#3225256, @Nuria wrote: > @kaldari: can you include a link to the source code for bot? Maybe it will benefit from using pageview.js... [15:45:17] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3225297 (10Niharika) [15:49:35] 10Analytics: AQS unique devices api should report offset/underestimate separately - https://phabricator.wikimedia.org/T164201#3225313 (10Nuria) [15:53:36] nuria: , yt? [15:54:37] ottomata: yessir [15:54:48] ottomata: should I join meeting? [15:54:51] nuria: trying to understand this EL patch [15:54:51] naw [15:54:56] ottomata: k [15:55:01] so, we want to ID events generated by bots [15:55:05] ottomata: francisco's? [15:55:08] yes [15:55:18] and then not produce them to their original schema topic? [15:55:23] yes [15:55:25] instead, producing them to a special catch all topic? [15:56:16] main objective is not to propagate bot traffic to db (unlike pageviews that is spam, mostly from google bot crawling android app) [15:56:33] dan suggested to propagate that data elsewhere instead of dumping it [15:56:49] but this second point is optional [15:56:54] ottomata: makes sense [15:57:33] ok, there is somethign in this that I do not like [15:57:48] i do'nt like that we are hacking up the EL processor that does special logic based on certain event fields [15:58:10] the trouble is that we need to look at the un parsed userAgent string to figure this out, right? [15:58:17] so it needs to be done before that info is removed... [15:59:29] I also really don't like that we are adding isBot to the event, and then later removing it [16:06:39] ottomata: Heya [16:07:02] ottomata: would you mmind suspending webrequest deletion for a few days? [16:07:23] ottomata: It would allow me have daily and monthly global uniques back from matrch [16:07:36] sure one min [16:29:54] actually many mins... :) making lunch.. [16:29:57] but ya will do [16:30:14] ottomata: before this evening please? [16:33:46] ottomata: read your CR and agree with " EL processor so that it can do special logic based on event fields." [16:34:01] ottomata: but not with changing the capsule which is a major, major ordeal [16:35:15] ottomata: let's talk about this, also agree with naming and made same comment on my 1st cr [16:37:07] ottomata: will have time by 1:30 pm your time [16:39:07] (03PS2) 10Nuria: Add support for monthly pageviews in metrics-by-project [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/348990 (https://phabricator.wikimedia.org/T75331) (owner: 10Mforns) [16:39:09] (03CR) 10Nuria: [V: 032 C: 032] Add support for monthly pageviews in metrics-by-project [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/348990 (https://phabricator.wikimedia.org/T75331) (owner: 10Mforns) [16:56:39] ottomata: heya, just realized I messed up the cron command in the previous patch - submitting a new one [16:58:55] joal: def before this evening, ok on patch ... [17:06:38] ottomata: We had a warning on upload dataloss - any info on upload-cache or network an issue? [17:06:54] ottomata: Sorry, I come to you when elukey is not here ;) [17:43:41] 10Analytics: Preserve userAgent field in apps schemas - https://phabricator.wikimedia.org/T164125#3225627 (10Tbayer) @Nuria: As the task name and description say, it's about preserving the userAgent field from getting auto-purged after 90 days in this case - not about adding new elements to it. [18:03:38] 06Analytics-Kanban: Unique Devices on Pivot, initial screen should not add values by default, is this configurable? - https://phabricator.wikimedia.org/T164194#3225083 (10Tbayer) (For reference, the previous discussion from which this arose: T159471#3223697 ) [18:31:21] joal: i'm poking around [18:31:27] it looks like it must be something like tha [18:31:33] pretty even loss from all active upload servers [18:31:35] hmmm [18:33:05] i don't see any relevant varnishkafka logs [18:33:07] which is weird [18:41:35] hm [18:42:09] talking to brandon in #traffic, he says there was some big 5xx mailbox issue during that time (don't know what that means yet) [18:42:16] but, 5xx shouldn't be a loss problem, those shoudl be logged [18:43:23] joal: there are a number of records with 'null sequence' but i don' really know why that would happen [18:44:02] ottomata: ok - Thanks for looking around [18:44:17] ottomata: Have you checked the snapshot-date patch? [18:44:39] ottomata: it corrects the previous one that should run tonight (that's why I ask) [18:45:27] this joal? https://gerrit.wikimedia.org/r/#/c/351162/ [18:53:54] yes ottomata that one [19:01:57] merged joal [19:46:21] Thanks ottomata - I hope the thing will work as expected :) [19:47:23] joal: i am proceeding to write code that choose the denormalized name with the most capital letters :/ [19:47:28] i think you will not like that [19:47:37] ottomata: /o\ [19:47:43] ;) [19:48:22] ottomata: That is a convention, but man, it's one of those you never get if not told [19:48:34] oh man [19:48:35] actually [19:48:41] just realized another problem [19:48:45] the lookup map we are building so far is flat [19:48:52] what if the schema is like [19:49:23] { fielda: struct { FIELDA: struct { FieldA: ... } } [19:49:26] } [19:50:04] erggghh joooooal this is kinda crazy, we don't have to do this! :) [19:50:12] we only are doing this because we have to recursively denormalize [19:50:44] ottomata: I'd go for easy: fields with same normalization are the same [19:51:08] ? [19:51:39] if field.name == field.name.toLowerCase, use field.name [19:51:45] else, use field name found with most captials? [19:52:39] else, ensure there is only a single name with capitals that ends up as field.name.toLowerCase [19:53:15] not understanding [19:53:17] ottomata: I really prefer to have a convention that says that different fields should have different names [19:53:26] joal: i'm beginning to agree [19:53:27] now that i'm doing this [19:53:38] i thought mayyybe w could figure out a convention that would work most of the time [19:53:41] but this is a little crazy [19:53:45] maybe i should just throw an error [19:53:46] :/ [19:53:55] but, that means the same for struct field names joal [19:54:08] we are requiring that field names at ALL levels are not repeated [19:54:12] which seems pretty strict [19:54:13] meaning [19:54:15] one couldn't do [19:54:48] { id: 1234, user: { ID: 456 } } [19:55:28] also, btw joal, there is no functional reason (for hive) that we are normalizing the struct field naems [19:55:40] ottomata: I know ... [19:56:15] we're only normalizing at all because spark is case sensitive for top level fields, and hive is not, so we get back case insensitive ones when we ask for the hive table schema [19:56:24] we only need to to compare what the hive table has, and what the incoming schema has [19:59:11] ottomata: I'm going to think about that in bed ... structured-lookup? [19:59:48] oof [19:59:50] haha [20:00:07] so complicated! but joal, all it is getting us is some feel good consistency in the hive schema [20:00:18] true [20:00:20] it doesn't practically change any usage of the table [20:00:31] oook, go to bed! think about it! i'll stop working on this part for now [20:00:33] and commit what I have [20:00:46] i'll take the rest of my afternoon to think about pivot/druid stuff :) [20:00:48] tomorrow :) Thanks for the puppet merge [20:00:56] yup! goodnight joal! [20:23:11] (03PS18) 10Ottomata: [WIP] Spark + JSON -> Hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [20:24:03] (03CR) 10Ottomata: [WIP] Spark + JSON -> Hive (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [20:28:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Spark + JSON -> Hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [20:48:48] 10Analytics, 10Analytics-EventLogging, 10DBA: Update MariaDB on analytics-store to a version that supports JSON functions - https://phabricator.wikimedia.org/T164224#3226250 (10Tbayer) [20:49:01] 10Analytics, 10DBA: Json_extract available on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T156681#2983518 (10Tbayer) Filed as a new task focusing on the upgrade (since this one had been closed and there was no response here) [21:00:48] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3226285 (10kaldari) @Nuria: What's the pageview.js client? Is there a link for that? [21:07:14] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3226290 (10Niharika) Possibly https://github.com/tomayac/pageviews.js [21:13:00] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3224615 (10MusikAnimal) >>! In T164178#3226285, @kaldari wrote: > @Nuria: What's the pageview.js client? Is there a link for that? https://github.com/tomayac/pageview... [21:15:50] 10Analytics, 06Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3226296 (10MusikAnimal) One thing I'd like to know more about is the throttling. @Nuria how does that work? Does it force the 100 req/sec limitation on a per-IP basis?... [21:20:09] 10Analytics, 10Analytics-EventLogging, 10DBA: Update MariaDB on analytics-store to a version that supports JSON functions - https://phabricator.wikimedia.org/T164224#3226303 (10jcrespo) 05Open>03declined MariaDB 10.2 is not stable for production usage as I am writing these lines, it wasn't on T156681, an... [21:28:01] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3226328 (10jcrespo) > That would put us into next FY Q1 for decom of db1047. Let's buy 2 new servers, and reassign them if the functionality is replaced. Knowing typical medi... [21:46:02] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3226387 (10Ottomata) Hm, ok, fine with me. Should we not do T159266 then and just wait for new boxes? [21:54:41] 10Analytics, 10DBA, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3226400 (10jcrespo) No, let's do that still- assuming no parts will be bought. No harm on a quick reboot and we will not by anything until next fiscal year (months away). [23:04:38] (03PS1) 10Nuria: Changes datasets api to accept a project or array of same [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/351210 [23:25:45] milimetric, reviewed the SpamBlacklist change, and chained it on top of tests for SpamBlacklist itself (same links, but ported to test it directly, plus one new). Ping me if you want to discuss. [23:33:37] (03CR) 10Nuria: [C: 04-1] "This changeset also needs to fix compare layout" [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/351210 (owner: 10Nuria)