[00:21:22] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4230465 (10CristianCantoro) I have run other tests and they took between 8 to 9.5 h using between 34GB to 36.5GB on a single machine with 8 cores. Also, I limited the problem with the data from `2007-12-10` to a few... [00:46:38] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Partially purge MobileWikiAppiOSUserHistory eventlogging schema - https://phabricator.wikimedia.org/T195269#4230482 (10chelsyx) Thanks @mforns ! >>! In T195269#4229044, @mforns wrote: > Maybe I'd advice to also purge the OS minor version, because it star... [00:53:35] (03CR) 10Milimetric: "This is ok by me, but it might be worthwhile to read through the Semantic layout grid and segment documentation to see if there isn't a wa" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505) [06:09:13] (03PS2) 10Sahil505: Upgraded footer UI/design [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) [06:14:58] (03CR) 10Sahil505: "I tried it with the semantic classes but it was giving me quite trouble to produce the responsive design which is all mobile compatible. A" (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505) [06:53:13] !log re-run webrequest-load-wf-upload-2018-5-24-23 and webrequest-load-wf-text-2018-5-25-4 [06:53:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:00:22] (03CR) 10Elukey: [V: 032 C: 032] Index some fields from isp_data and geocoded_data in Druid's webrequest [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433597 (https://phabricator.wikimedia.org/T194055) (owner: 10Elukey) [07:10:20] * elukey goes afk again :) [07:12:14] milimetric: 1hourBye elukey :) [07:12:26] mwarf [07:12:52] addshore: let's talk about your data research when you're up :) [07:47:48] joal: yay! [07:47:53] Not quite up yet ;) [07:48:02] addshore: coffeeeeeeeeee ! [08:08:34] Okay, I think I am awake now! [08:10:30] Hi addshore :) [08:10:45] Good morning! [08:10:49] addshore: have you find the results you wanted with Hive? [08:11:25] Well, the query gave me results :D Although I'm not sure if it was done in the best way [08:12:15] addshore: I was willing to try and see if similar could have been extracted from turnilo or superset (basically using Druid - a lot cheaper computationally than Hadoop_) [08:14:42] addshore: webrequest in Druid in sampled 1/128, but for high level traffic, could do the job I htink [08:14:59] okay [08:15:16] so the hive query itself ended up not showing any dramatic increase in the number of requests in the hour that I was looking at [08:15:53] addshore: I have similar result from turnilo :) [08:16:10] addshore: do you want me to give you a quick tour of the thing, and discuss usage? [08:16:50] Oooooh, yes please! [08:17:57] addshore: https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave [10:38:51] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4231124 (10CristianCantoro) So, I am basically writing another scirpt that does not use Spark but simply process the data in a streaming fashion (the basic idea of the algorithm is: take one day worth of data, sort... [11:06:06] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Present a page view metric description to the user that they are likely to understand - https://phabricator.wikimedia.org/T182109#4231138 (10sahil505) [11:08:46] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Hide "Load more rows..." once all the data is visible in Table Chart - https://phabricator.wikimedia.org/T192407#4231142 (10sahil505) [11:14:50] Hi miriam [11:14:53] :) [11:15:33] hi joal :) [11:15:52] So how may I help [11:16:31] You said you'd like to try and follow users through their 'sessions' on wikis, for those specific users having clicked on banners [11:16:37] Have I understood that correctly? [11:17:17] Yes! after a conversation yesterday with fundraising,we were wondering whether the easiest way to sample by user in a/b tests is to look at web request logs. [11:17:54] * Seddon follows [11:18:15] Hi **wild** Seddon :) [11:18:38] not sure about what you mean by "sample by user" [11:19:25] for example: can I know which X users (unique ids from the logs) have accessed one page in wikivoyage? [11:20:12] and then see how long was their session? i.e. how much time they spent on the wiki and how many clicks (requests) they did? [11:20:23] Seddon am I asking the right thing :) [11:20:30] miriam: to some extend you can know that, but with some level of imprecision [11:21:17] miriam: We don't "track" users in webrequest logs - Meaning we have no unique ID identifying sessions [11:21:21] miriam: Thats pretty much what we spoke about yeah [11:21:27] ok, thanks joal! Imprecision is caused by sampling or by the unique identification? [11:21:49] miriam: So, with fingerpriting, we can get some info of "session" [11:22:14] miriam: webrequest log is not sampled, we have everything in there [11:22:28] oh wow nice :) [11:22:42] miriam: downside is, it's relatively big :) [11:22:51] joal: fingerprinting = hashing ip/agent ? [11:23:08] miriam: correct - we can add some more to that to try to be more precise [11:24:06] miriam: major downside of this aprroach - We know it doesn't work for mobile web, since agents are very similar and operators do IP-pools for multiple devices [11:24:31] miriam: And for desktop, it leads to relatively correct (we assume) results, but with some level of imprecision [11:24:49] joal: thanks. Seddon: what you wanted to test was a new mobile landing page, correct? [11:24:50] joal: is there a better means of doing this on mobile? such as via tracking by cookie? [11:24:57] miriam: correct [11:25:43] Seddon: there are plenty ways to that, cookie being one of them - However the fundation explicitely says it doesn' track users ... [11:26:12] one precision about mobile: mobile web, right - not the apps? [11:26:20] mobile web [11:26:24] ok indeed [11:28:52] joal: Actually our privacy policy explicitly states we do track users for these kinds of purposes [11:29:18] and explicitly for mobile too [11:29:33] joal: if we fingerprint mobile web logs with ip and agent, would we have multiple users mapped to the same id? [11:29:37] also Seddon and miriam, I know nuria_ has been involved in A/B testing discussions (problems of statistical soundness) It would be interesting to have her views [11:30:47] miriam: yes - For dense areas and generic smartphones, since operators use the same IP for multiple mobiles, you'd easily end up with many iphoneY in the same bucket for an IP located in new-york for instance [11:31:06] Seddon: `this kind of purpose` - Can you precise? [11:33:23] "How We Use Information We Receive From You? > Improving the Wikimedia Sites and making your user experience safer and better > For research and analytics." [11:33:37] "How We Use Information We Receive From You? > Improving the Wikimedia Sites and making your user experience safer and better > To optimize mobile and other applications." [11:34:02] oh, yeah, would be great to hear nuria_'s thoughts [11:34:09] The only thing we say we will never do is use third party cookies [11:35:16] Seddon: This problem of `cookification` of our users has been discussed a lot [11:35:40] I can certainly imagine [11:35:43] Seddon: analytics-team has been pushing on the side of trying to not have them [11:35:58] Is there a better option for mobile web? [11:36:56] Seddon: I can't easily think of one - We could hwever restrict the cookification to only those users we're interested in, and for only the session that follows a banner click for instance [11:37:31] Like that the cookie is set for a single purpose and prevents having a lot of info on a lot of users [11:38:16] miriam: would it work if we simply assigned "bucket" id's? [11:38:42] Actualy Seddon - Even better - Use of events through eventlogging [11:38:47] and miriam --^ [11:39:57] Seddon: i was thinking we could approximate with bucket ids - but I need to chack statistical implications of that [11:40:31] miriam: for statistical soundness of A/B testing, I really think you should talk with nuria_ - She's put a good deal of thoughts on that lately [11:41:02] joal: the only issue with eventlogging is that we are measuring the impact of content (a site main page) rather the impact of a banner or software. So there isn't a specific action being tracked. [11:41:22] I'm wanting to measure general reader engagement and retention [11:41:32] joal: I'll definitely ping nuria_ about it! [11:42:43] Seddon: I could imagine funky ways to do that - We put a cookie when the user sees the banner (but we don't log it) - And the frontend sends events to eventlogging when the cookie is set. [11:43:03] miriam: I think she's traveling today [11:45:07] joal: ok, i'll drop her an email, it's anyway interesting to hear her thoughts [11:45:45] Seddon, miriam: She'd also be the one having more context on what has already been discussed [11:47:00] joal: not super familiar with eventlogging - if you have any handy pointers to that could you please share? [11:47:26] miriam: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [11:49:02] oh wow, thanks joal! I am checking [11:51:06] Seddon, should we organize a call with nuria_ to discuss about this? And joal, if you like to join? [11:51:42] Please invite me miriam :) [11:55:42] great, I'll send email, thanks joal! Super helpful!! [11:55:47] miriam: makes sense! [11:56:28] you're welcome miriam :) [11:56:38] talk soon miriam and Seddon :) [11:58:43] miriam [11:58:46] thank you! [11:59:19] Seddon: you are very welcome! [12:06:27] miriam: agreed that we would loose a lot of statistical information if we could only monitor whole populations rather than individuals. [12:59:59] (03PS2) 10Joal: Update mediawiki-history stats [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/434987 (https://phabricator.wikimedia.org/T192481) [13:33:16] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231434 (10Psychoslave) >>! In T193728#4204019, @TomT0m wrote: > The more I personnaly dig into this questions, the more issues are opened and the less clear it becomes that there is an... [13:35:32] (03PS1) 10Joal: Sqoop script had duplicated short parameters [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169 [14:01:44] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231541 (10Psychoslave) >>! In T193728#4204771, @Denny wrote: > @Psychoslave sorry to disagree on the questions, but are we in any disagreement on these three questions? > > We should... [14:07:51] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231560 (10Psychoslave) >>! In T193728#4204779, @Denny wrote: > @Nemo_bis thanks, I agree with your point a lot. > > But regarding your question - just because there is a database whic... [14:20:43] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231596 (10Nemo_bis) > We should first agree that the problem is really about "substantial transfer of data" It's not. [14:32:22] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231633 (10Psychoslave) >>! In T193728#4205401, @Denny wrote: > @Rspeer regarding the ontology: the ontology of Wikidata is genuinely unique and not copied from any Wikipedia project, o... [14:33:14] * leila waves to people while she prepares coffee. [14:33:36] ciao leila :) [14:33:46] miriam: you mentioned the issue with telling unique devices apart on mobile and using webrequest logs. [14:34:41] miriam: I don't know the accuracy that fundraising needs, but in Why We Read Wikipedia, we do that and we catch almost all mobile responses (we have to match between webrequest logs and eventlogging data) [14:35:31] miriam: try it with ip, user_agent, referrer, and if you add browser language. [14:35:37] ciao miriam ;) [14:37:01] leila: would using referrer work if you want to track across multiple sessions? [14:37:31] I wouldn't have thought so [14:37:47] Seddon: in how long of a period? [14:38:33] days probably [14:39:19] at least for this test though weeks for future tests [14:40:14] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231659 (10TomT0m) > Well, entries that were not created thanks to massive import from Wikipedia obviously don't raise any concern of infringement of Wikipedia community copyright. It d... [14:40:27] Seddon: I can't comment on weeks, but we have done it for 1 week. The longer the period, I suspect, the more errors take place. [14:40:28] leila: it would be great to understand whether people who have been exposed to the new landing page are more likely to come back than people who have seen the first landing page [14:40:50] first <-> original [14:41:14] miriam: for that, you need more than a week then. since these user behavior changes can take time. I agree with you that then cookies may be the right solution. [14:41:36] miriam: I'll still connect you with Florian if you want to check with him about the week long data. [14:42:40] leila: it is however itneresting you managed to idenrigy unique devices, it might be useful if they decide to look at conversions and dwell-time only, and measure retention in a second experiment [14:42:59] leila: did you do any evaluation to check that these are actually unqieu devices? [14:43:00] miriam: yup. agreed. [14:44:19] miriam: in our case, the evaluation is if we can catch almost all responses that are gathered in eventLogging when linking to webrequest logs. For those responses, we log IP and userAgent (maybe hashed), and we could do almost perfect matching, ~95% [14:44:26] * leila goes to breakfast and will be back in 10. [14:46:00] , [14:46:08] leila: got it! I missed that nit above :) thanks! enjoy your breakfast! [14:54:38] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231690 (10Psychoslave) >>! In T193728#4212631, @Denny wrote: > @Rspeer > But even ignoring that, Wikidata does *not* store the same expression anyway. So what exactly is the copyright... [15:10:04] miriam: just in case you need me, I'm around. [15:29:49] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231813 (10Psychoslave) >>! In T193728#4214437, @MisterSynergy wrote: > If any of those happened (or had to happen), I’d be out here and I guess many other Wikidata editors would also d... [15:53:02] (03CR) 10Mforns: "AMAAAZING!!!!! The way it keeps state and all options while navigating is great. So big of an improvement! Awesome." (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434500 (https://phabricator.wikimedia.org/T179444) (owner: 10Milimetric) [15:57:24] (03CR) 10Mforns: "Maybe if we went with queryString format, we could have default values for detail-states, such as scr=normal or break=none. This way, in t" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434500 (https://phabricator.wikimedia.org/T179444) (owner: 10Milimetric) [15:58:38] miriam: you still about? [15:58:39] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231920 (10Psychoslave) >>! In T193728#4228561, @Micru wrote: > > In a way Wikipedia already has a "contribute-alike" agreement, it is just not explicit, but tacit. Users come to the s... [15:58:51] hello Seddon! still here :) [15:59:10] (03CR) 10Mforns: "Another idea is to calculate the 2-Year, 1-Year, 3-Month, etc from timestamps, so we do not have to specify that in the URL. Also maybe re" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434500 (https://phabricator.wikimedia.org/T179444) (owner: 10Milimetric) [16:00:56] fdans, get up stand up [16:03:49] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4231932 (10MisterSynergy) >>! In T193728#4231813, @Psychoslave wrote: >>>! In T193728#4214437, @MisterSynergy wrote: >> If any of those happened (or had to happen), I’d be out here and... [16:24:42] leila: is the process you used to validate/track ids documented somewhere? [16:36:43] miriam: all code is in Github. let me dig. [16:39:01] thanks, not urgent, you can just send it via email whenever you are done! [16:56:42] miriam: the old code, which we're reusing for the most part is at https://github.com/ewulczyn/wiki-readers/tree/master/src/data_generation [17:48:00] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4232253 (10ArthurPSmith) Some references on why CC0 is essential for a free public database: https://wiki.creativecommons.org/wiki/CC0_use_for_data "Databases may contain facts that, in... [18:12:33] 10Analytics, 10Beta-Cluster-Infrastructure, 10Puppet: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4232297 (10Krenair) [20:51:15] Hello a-team, is it possible to get number of page views per "action URLs", for example is it possible to find out number of page views on https://en.wikipedia.org/w/index.php?title=Wikipedia:Username_policy&action=history? [22:40:35] 10Analytics, 10Analytics-Wikistats, 10Domains, 10Operations, 10Traffic: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4233112 (10Dzahn) [22:45:10] 10Analytics, 10Analytics-Wikistats, 10Domains, 10Operations, 10Traffic: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4233129 (10Dzahn) option a) delete stats record from the wikipedia.org zone option b) add stats.wikipedia.org to hieradata/role/common/cach... [23:51:55] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4233239 (10Micru) >>! In T193728#4231920, @Psychoslave wrote: > From what I understand, you are describing the "same condition" which is expressed by the SA in the CC-BY-SA covering Wik...