[06:16:29] (03PS1) 10Sahil505: [WIP] Added CSS custom properties using postcss [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/437387 (https://phabricator.wikimedia.org/T190915) [06:37:32] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4255890 (10elukey) Reporting a IRC discussion in here. It would be great to make a list of next steps for: * remove IPSEC completely between Jum... [08:47:32] 10Analytics, 10EventBus, 10ORES, 10Patch-For-Review, and 3 others: Numeric keys in ORES models causing downstream Hive ingestion to fail - https://phabricator.wikimedia.org/T195979#4243147 (10awight) n.b., we'll also want to revert https://gerrit.wikimedia.org/r/436529 [09:23:06] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4256392 (10Pchelolo) [10:07:37] * elukey early lunch! [11:15:56] 10Analytics, 10WMDE-Analytics-Engineering, 10User-Addshore: dbstore1002 (analytics store) enwiki lag due to blocking query - https://phabricator.wikimedia.org/T175790#4256638 (10jcrespo) 05Open>03Resolved Not happening at the moment. [11:42:01] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4256729 (10Pchelolo) [11:42:06] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4256726 (10Pchelolo) 05Open>03Resolved We have switched all jobs except certain outstanding problematic ones and we have... [12:02:40] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 10Services (watching): Consider increasing retention for mediawiki event topics - https://phabricator.wikimedia.org/T196409#4256772 (10mobrovac) [12:05:45] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, 10Services (watching): EventBus should produce messages to Kafka with event time set to meta.dt - https://phabricator.wikimedia.org/T196407#4256795 (10mobrovac) [13:08:45] 10Analytics: Access request for Superset: - https://phabricator.wikimedia.org/T196458#4257350 (10Jseddon) [13:24:32] 10Analytics, 10EventBus, 10ORES, 10Patch-For-Review, and 3 others: Numeric keys in ORES models causing downstream Hive ingestion to fail - https://phabricator.wikimedia.org/T195979#4257471 (10Ottomata) Oh, cool, the fix is deployed? If so I'll reenable Hive stuff. [13:30:35] 10Analytics, 10EventBus, 10ORES, 10Patch-For-Review, and 3 others: Numeric keys in ORES models causing downstream Hive ingestion to fail - https://phabricator.wikimedia.org/T195979#4257508 (10awight) @Ottomata Pending deployment, we'll update here when deployed. [14:49:21] mforns: wanna chat in cave a little? [14:49:37] milimetric, hi! yea, give me 3 mins please [14:49:51] anytime [14:54:34] milimetric, omw! [15:00:05] (03PS2) 10Mforns: Allow partial whitelisting of map fields [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/437269 (https://phabricator.wikimedia.org/T193176) [15:00:37] (03CR) 10Mforns: [C: 04-1] "Still have to test this with real data." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/437269 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [15:07:52] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Wikimedia-Stream, and 2 others: EventBus should produce messages to Kafka with event time set to meta.dt - https://phabricator.wikimedia.org/T196407#4257802 (10Ottomata) Ah, we need to change a Kafka broker setting to support this too. `log.message.timestamp.... [15:48:06] mforns: still got time today? [15:55:05] milimetric, sure [15:55:28] ping me whenever you can chat [15:56:03] mforns: hm, maybe i should just try to get some progress on fixing the redirect routing and the date stuff, and then we can talk over the patch, might be more productive? [15:56:16] elukey: sorry, repeating question from yesterday [15:56:35] how you prefer milimetric, I'm available [15:56:48] elukey: did we decided not to add maxmind ip info to webrequest_sampled? [15:56:57] k, I think I have a good direction from your review - try to polish it up, then we can talk after I get the patch in [15:57:13] ok milimetric, lmk :] [15:59:36] nuria_: iirc it was also something that we did for webrequest, the dimension's cardinality would have been too high.. [16:00:01] ah wait maxmind ip doesn't mean ip [16:00:04] misread sorry [16:00:10] elukey: ah, [16:00:19] elukey: maxmind Ip provenance [16:00:23] elukey: makes sense? [16:00:39] so what I added was Country and Continent from maxmind, and as-number + isp [16:00:56] so 4 new field [16:00:59] *fields [16:01:04] elukey: the webrequest_sampled _128 [16:01:09] has Ips https://turnilo.wikimedia.org/#webrequest_sampled_128/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhqZqJQgA0hEAtgKbI634gCiaAxgPQCqAKgMIUQAMwgI0tAE5k8AbVBoAngAcmBDHSGTaw5gH09GppSMAFKVgAmM+SEsxJ6LLgKmAjABEhS1cwTpaKDQhINplfABaNwBfAF04yihlJDQbWMo6OFhtGVBoAFkYcQh8YVJaRIhsAHMENRAACwhUoWoijHxZRuayeMplKuxaSw8aWmwoZzT+weGAZUxJYIImlso66vHLfGwihEom6oakI+XdhARooA [16:01:35] elukey: but not country [16:01:36] ah! My bad than, I thought the opposite [16:01:45] it's country code [16:01:48] sorry :) [16:02:04] elukey: nor as [16:02:27] well it has Country Code and AS Number [16:03:49] nuria_: so to summarize: AS Number, Isp, Country code, Continent [16:04:05] and the Ip was already there but I forgot about it :) [16:04:35] elukey: ok, but ... ahem.. those dimensions do not appear on turnilo yet correct? https://turnilo.wikimedia.org/#webrequest_sampled_128 [16:05:30] I can see them in the left column, but maybe it is not what you are looking at [16:05:51] https://turnilo.wikimedia.org/#webrequest_sampled_128/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhpwBGCApiADSEQC2ZyOFBAomgMYD0AqgCoDCVEADMICNGQBOUfAG1QaAJ4AHZjXpDJZYfhAB9PRg3UjZAApSsAExl55IKzEnosuAmYCMAESFLVuhHQyKDQhELJlfABaDwBfAF0E6ihlJDRbez81EMkIbABzISs6MmwoV112HAxsUtDkzElQvFAtHQIACwh0oogtdgwcXStg9lLigpBY6iRabvwAVgAGJJB6OFgtW1BoAFkYcQh8YUQoMmS8/PJdLp7qWgOMORBbmUTqZTzaqy8SstcMvEPl8yFYAMqNZovboyGZkfLjfDYA [16:05:57] 4IahdfIdJAY5rIhAIWJAA== [16:06:00] what a horrible paste [16:06:36] https://bit.ly/2xNOrGs [16:07:43] ottomata: https://phabricator.wikimedia.org/T196079 looks gooood, if you have time can you double check? [16:10:08] elukey: now! i see it , it was below the fold, thank you [16:12:07] ah nice! For a moment I thought that everything was horribly wrong :D [16:13:13] elukey: +1 from me [16:13:55] elukey: ya, it is a css issue with screen size, but good to know [16:15:18] ottomata: ack thanks! [16:50:17] fdans: added you as reviewer since you love timezones: https://gerrit.wikimedia.org/r/#/c/437479/ :) [16:50:36] hoisted by my own petard [16:50:51] cool ottomata!! will take a look in a lil bit [16:51:11] danke [16:51:13] no hurry at all! [16:58:49] * elukey off! [17:40:36] neilpquinn HaeB halfak|Lunch: can you tell me about the graph of improving new editor retention? https://commons.wikimedia.org/w/index.php?title=File:Wikimedia_Foundation_Audiences_metrics_Q3_2017-18_(Jan-Mar_2018).pdf&page=31 [17:40:39] Two things, in particular: what are the absolute numbers like, and is there an easy way to calculate this metric for an arbitrary set of users? I'm curious how many of those retained new editors are Wiki Education students. [17:40:39] If I'm interpeting the notes correctly, this is defined as someone who made a edit in their first thirty days after registration who also made an edit in their second thirty days? That definition would end up counting most of our student editors as retained. [17:42:20] ragesoss: I can't give a full answer right now (try shooting me an email) but yes, you have the definition right. [17:43:18] ragesoss: we have an internal data set (edit data lake) that we hope to have on labs at some point this year that makes possible to calculate things such as these easier than it might be [17:43:19] so one of the things I want to look at is retention over a longer period (say 6 months rather than just 2) and see whether the bumps go away [17:43:33] using data from the replicas [17:43:36] neilpquinn: cool. will do. do you have a rough idea (order of magnitude) for the ' [17:43:44] If they do, that will suggest (as you were probably thinking) that students doing classwork is the main factor [17:43:48] new editors' per month that this retention rate is based on? [17:44:43] neilpquinn: yes, that's right; I would expect the bumps to go away or be smaller for a 6-month metric if Wiki Education students are a significant part of the bump. [17:46:51] ragesoss: the monthly number of new editors (1 edit in the first 30 days) on enwiki is about 1700 [17:47:29] neilpquinn: monthly is that low? [17:47:31] yikes [17:48:21] and the monthly number of second month editors (new editors also making 1 edit in their second 30 days) is about 100, spiking to about 175 during these Jan/Sep spikes [17:48:28] wait, hold on [17:49:37] maybe that's per day? [17:51:44] ragesoss: oh, yeah, that's per day, good catch [17:52:30] cool. I'm pretty sure those bumps are mostly Wiki Education students. [17:52:46] so, every month, there are roughly 50,000 new editors and 3,000 second-month editors, spiking to 5,200 [17:53:14] if you feel like wading through spaghetti code, my analysis is at https://github.com/wikimedia-research/2018-01-new-editor-retention-analysis/blob/master/analysis.ipynb [17:53:19] ;) [17:53:27] awesome, thanks. [17:54:23] ragesoss: yeah, it seems pretty plausible given the numbers involved. I would like to verify that, but unfortunately, I don't have much time for it right now [17:59:25] neilpquinn ragesoss : btw https://phabricator.wikimedia.org/T188070 is still open ;) [18:19:04] ^ cool read :) [18:19:40] <3 to neilpquinn and ragesoss for trying to make sense. [18:19:53] ragesoss, you could calculate this with the labsdbs [18:20:26] See the "SQL" on the right side of https://meta.wikimedia.org/wiki/Research:Surviving_new_editor [18:23:05] halfak: is `censored` something that is censored, just an alias for something that doesn't get used anywhere else in the SQL query? [18:28:18] "censored" in timeseries data means that there isn't enough information [18:28:33] E.g. if you're looking for activity within 6 months from an editor who only registered 4 months ago [18:28:42] ah, cool. [18:35:19] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Google-Summer-of-Code (2018): Proposal: [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T190949#4258718 (10srishakatux) 05Open>03declined [18:54:46] ottomata, still around? Was testing EL Sanitization, and it seems to work!!! :D However, I came to a question: If we fully purge a struct field, hive keeps its structure but nullifies all fields in it. If we fully purge a map field, we have to do it by hand, so we can choose: [18:55:07] either keep only the whitelisted subfields of the map [18:55:20] or keep all fields in the map, and nullify their values [18:55:33] much like hive handles struct = null [19:00:45] hmmm... I'm thinking now that in the case of maps, the mere presence of a field might add entropy to a record [19:01:50] in the case of struct fields, that is not possible, because the schema is specified somewhere public [19:01:58] hmm [19:02:15] but with maps, I think we better only output whitelisted subfields [19:02:22] mforns: the reason there are maps in the data at all ( at least currently) is just an implementation problem [19:02:34] if we were better and didn't have to deal with capsules or other crap [19:02:37] it'd be structs [19:02:41] but [19:02:46] i think it might be best to do the same for both? [19:03:00] because from the users perspective rigiht now, there isn't a difference between a JSON map and JSON struct [19:03:02] I was leaning towards that, but... [19:03:11] so to them the events look the same, and are emitted the same [19:03:24] well, the difference is in the schema [19:03:30] halfak: I'm trying to run it for January 2018. Any guess how long it will take? I'm guessing it's going to get killed before it finishes on quarry. [19:03:59] Oh! ragesoss one other thing. Switch out "revision" for "revision_userindex" [19:04:04] and the same for "archive" [19:04:06] mforns: when you say 'schema' are you refering to the eventlogging jsonschema schema, or the inferred hive table schema? [19:04:11] although, EL users do not control how their nested structures are stored.. [19:04:19] EL jsonschema [19:04:25] ragesoss, That'll make it a lot faster [19:04:34] hm [19:04:39] Those queries were written for the internal analytics slaves. :| [19:04:41] aye, json/jsonschema doesn't have a difference between map and struct [19:04:51] halfak: "same for archive" as in archive_userindex, or switich archive to revision_userindex? [19:05:28] the geodata thing is really only a map because that is how the refinery source code returns it [19:06:11] mforns: so, if easy, i'd keep the map fields and just null them [19:06:23] so the behavior of the purging for the data is simliar and familar to users of both [19:06:47] but ottomata, imagine an EL schema has a map field that has a subfield is_man, that is only there when true. If we keep the fields, even when nullified, the entropy of that field is still there [19:07:17] so even after purging it, it would still tell you the user's gender [19:07:58] the thing is, by definition, maps can vary their contents [19:08:12] if a field being or not being there has entropy [19:08:25] then we can not sanitize it, if we keep the subfields [19:09:22] right, but at the moment, there is no way for users to submit map fields [19:09:30] they only come from internal transform functions [19:09:55] If this purging code works for other data systems though...then yeah [19:10:51] and mforns [19:10:59] hm [19:11:06] maybe the code could support both [19:11:18] hmm [19:11:25] i guess it aleady does [19:11:29] ottomata, you know what happens if in hql you query for map['field'] and 'field' is not there? does it break? [19:11:33] or just returns null? [19:11:39] i'd assume it returns null [19:11:42] since it is a map... [19:11:43] hmmm [19:11:46] will test that [19:11:46] but not certain lets try! [19:11:50] ... [19:11:51] :) [19:11:52] if so, I think there's no problem [19:11:54] rigiht [19:12:01] because then it'd be the same to users either way [19:12:03] so, you'd do it like [19:12:05] yea [19:12:24] halfak: looks like 'archive_userindex' is what you meant (as that exists). carry on . :) [19:12:25] i the whoel map_field is blacklisted, you'd make it an empty map? [19:12:29] maybe if the whole map field is NULL, then trying to access it breaks the query... [19:12:37] yeah that probably would [19:12:42] currently it's a NULL yea [19:12:53] I can modify it to be an empty map [19:13:08] right, I think I'll go with that, if you're OK [19:14:11] yeah that makes sense to me [19:14:23] cool, thanks :] [19:22:51] 10Analytics, 10Analytics-Cluster, 10Services (doing): Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#4258910 (10Nuria) Maybe worth ckecking varnish throttle? https://github.com/varnish/varnish-modules/blob/master/src/vmod_vsthrottle.vcc I think it implements a leaky buck... [19:26:17] ottomata: where where on varnish do we specify the pass to event streams node backend? [19:38:06] 10Analytics, 10Analytics-Cluster, 10Services (doing): Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#4258949 (10Nuria) I think our varnish code limits are per IP, in this case that would not work as I believe we would need "service limits" for the eventstreams endpoint fo... [19:40:22] neilpquinn halfak: I couldn't get the SQL query to run (it just returned 1 row), but I got numbers from the dashboard.wikiedu.org DB: I ran the count of January-registered users who made an edit within 30 days of registration *and* an edit between 30 and 60 days after registration, and I got 2134. That lines up neatly with the spike from a 3000 per month baseline to 5200 for the January cohort. [19:41:21] Perfect. mostly likely, the only difference is that the query I gave you uses the archive table to look for edits to deleted pages. [19:41:35] Most active people edit pages that don't get deleted, so it only affects the result a little bit. [19:41:51] s/perfect/good-enough/ :) [19:42:04] halfak: also, unless they are deleted *very* quickly, the dashboard DB will have those edits. [19:42:16] Even better [19:44:07] (it also might undercount based on namespaces, since the dashboard only imports revisions for some of them, but I would guess that few to none of our student editors made edits *only* in namespaces we don' [19:44:09] t track) [20:01:43] +1 [20:02:26] looks like superset is getting more and more support :) http://superset.incubator.apache.org/ [20:21:25] nuria_: sorry! missed your ping before [20:22:35] https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/cache/misc.yaml#L302-L307 [20:23:18] joal: whatcha mean? [20:25:50] (03PS3) 10Milimetric: Adjust date formatting in the hover box [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434557 (https://phabricator.wikimedia.org/T194430) [20:38:54] (03CR) 10jerkins-bot: [V: 04-1] Adjust date formatting in the hover box [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434557 (https://phabricator.wikimedia.org/T194430) (owner: 10Milimetric) [21:44:40] (03PS3) 10Mforns: Allow partial whitelisting of map fields [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/437269 (https://phabricator.wikimedia.org/T193176) [21:48:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Productionize EventLogging sanitization - https://phabricator.wikimedia.org/T193176#4259381 (10mforns) OK, this looks ready to review and merge if appropriated. I tested with real data, and looks good :D