[01:00:50] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2440397 (Ottomata) [08:12:40] Analytics, Community-Liaisons (Jul-Sep-2016), User-Johan: Collect information about how we collect user statistics in one place - https://phabricator.wikimedia.org/T132405#2440792 (Qgil) p:Normal>Low [08:27:20] morning joal ! [08:27:40] Hi addshore [08:27:41] :) [08:28:52] ready for give it another shot? :) [08:29:44] addshore: Indeed :) [08:30:00] addshore: I htink we have it solved (meaning within a few try it should be ok :) [08:31:03] great! [08:31:16] addshore: You recall the issue of giving spark the additional jars and files for hive settings [08:31:27] addshore: In oozie, those jars are available by default ! [08:31:28] yup [08:31:40] okay, so I can remove those from the command! [08:31:43] addshore: And the hive file we have it stored on HDFS [08:31:47] the jar only [08:32:08] addshore: the --files hive is still needed, but the file path we will update [08:34:29] hmm, oka [08:34:57] addshore: the hive-site we will use is: ${refinery_directory}/oozie/util/hive/hive-site.xml [08:35:01] so I am removing the jars stuff from refinery oozie files? [08:35:13] addshore: yes, you can [08:35:23] addshore: You see know why we say oozie is hell [08:35:40] addshore: anything to add/remove is in 3 files at least, written in verbose XML etc [08:35:47] addshore: :( [08:36:06] :D [08:36:26] addshore: At scheduling the thing is great, but coding wise ... [08:38:11] (PS10) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (https://phabricator.wikimedia.org/T138500) [08:38:28] okay, so those all look updated! *logs into stat1002* [08:40:42] addshore: You're too fast ! [08:40:47] addshore: There is another thing [08:40:51] :D [08:41:13] addshore: We need to explicitely set parameters for dynamic allocation in the spark opts [08:41:46] addshore: --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true [08:41:58] addshore: And I think with that you're all set and you can test :) [08:42:11] addshore: actually, maybe it's me being slow ;) [08:42:25] so thats in workflowxml? [08:43:17] (PS11) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (https://phabricator.wikimedia.org/T138500) [08:43:55] addshore: yes, in spark-opts [08:44:17] okay, all ready to run again! [08:44:27] addshore: waiting for your coordinator ID :) [08:45:37] *has run the command* also waiting for the id [08:46:08] addshore: CLI has given it to you, no ? [08:46:24] oh yeh! :D 0012350-160630131625562-oozie-oozi-C [08:46:27] :) [08:46:35] thx [08:47:00] java.io.FileNotFoundException: File file:/var/lib/hadoop/data/g/yarn/local/usercache/addshore/appcache/application_1467197794735_26441/container_e25_1467197794735_26441_01_000003/= does not exist [08:47:17] Maaaaaan, I don't get that !!! [08:47:35] Can you give me the command you run in a gist or so ? [08:47:59] https://www.irccloud.com/pastebin/CXmvTQPw/ [08:49:07] addshore: One detail (not the one cayusing failure, but interesting for test being stopped: it's not end_time but stop_time (I think I mistake yesterday) [08:50:14] ahh yes! [08:50:31] addshore: 0012355-160630131625562-oozie-oozi-C [08:50:53] oooh, orange! [08:51:21] addshore: nope, same error [08:51:27] oh :/ [08:51:29] addshore: wow ... [08:51:37] addshore: That is very, very weird [09:01:11] addshore: I got it ! [09:01:18] addshore: Or at least I think :) [09:03:06] :D [09:04:09] addshore: two things: in the command you pasted, spark_job_jar has spaces between = and values - For bash, that doesn't work [09:04:43] oh balls.... yes! [09:04:53] addshore: Second thing, since the path is in hdfs, it should be specified (actually fully specified): -Dspark_job_jar = hdfs://analytics-hadoop/user/addshore/refinery-job-0.0.32-SNAPSHOT.jar [09:05:24] addshore: 0012364-160630131625562-oozie-oozi-C [09:05:42] its green! [09:05:48] addshore: That also means the job would have worked in prod ;) [09:06:09] addshore: I don't see it green yet, but this too will come to me ;) [09:06:26] :D [09:06:40] Woooooooo!! [09:07:16] addshore: Congratulations, you have tamed one of our big dragons :) [09:10:18] addshore: I will triple check the CR, then ask ottomata or nuria to have a review as well (not for functional, but for naming etc) [09:10:19] so now to get it all merged and running? ;) [09:10:23] awesome! [09:10:47] addshore: Deploy is as nice to us as oozie coding is (kindof) [09:10:48] Right I have a meeting in 20 mins and will be around for the rest of the day but otherwise see you on monday! [09:11:08] addshore: No problem, I'll ask the guys if they can do it today [09:11:22] addshore: Then deploy, I think it will happen next week [09:11:33] awesome! [09:11:43] early next week (probably Tuesday or Wednesday) [09:12:18] addshore: I'm really happy you took some time to learn and make the things work using our tools :) [09:12:37] yup! and I will likely have to try again sometime soon, so I'll see how that goes! [09:12:44] or do the api one for bd808 ! [09:12:52] addshore: Right :) [09:13:20] addshore: One thing to always keep in mind before even starting to automate is: Will those things fit from a datasize / runtime perspective [09:13:53] hmm, how do you mean? [09:14:03] addshore: Cause if you build jobs that should be running hourly but take 3 hours to run and stale the cluster, that doesn't work :) [09:15:05] ahh, yes, indeed! [09:15:36] addshore: That was just an example, but I think it's important to keep in mind how much data and computation the things we do involve :) [09:15:47] addshore: Cause hadoop makes it very easy to forget it :) [09:17:53] So, I'll throw this idea ta you too! A while back I was thinking of looking into loading a wikidata dump (so lots of bits of json) into hadoop to possible query it [09:18:07] Analytics, Community-Tech, Pageviews-API, Tool-Labs-tools-Pageviews, I18n: Topviews in the Pageviews labs tool doesn't auto-exclude special pages with localized names - https://phabricator.wikimedia.org/T139725#2440845 (Amire80) [09:18:17] does something like that sound possible? I was struggeling to think of partitions though :/ [09:28:04] addshore: loading a dump seems completely feasible [09:28:29] addshore: the complexity comes with actually doing stuff with it :) [09:29:03] addshore: The analytics cluster wouldn't be the best place to test/try/research on a dump [09:29:31] addshore: the Altiscale Internet-Archive cluster would be better [09:29:34] joal: o/ anything against me upgrading aqs100[23]? [09:29:47] elukey: o/ [09:29:53] elukey: please do :) [09:29:56] :) [09:30:23] addshore: Altiscale is a cluster-on-demand company, lending a cluster for research purposes [09:30:39] addshore: The person to talk to is halfak, the cluster manager for us :) [09:33:58] ooooohh [09:34:01] okay! cheers! [09:34:11] have a good day and a good weekend addshore [09:34:28] Going windsurfing later, It will be great! :D [09:34:34] Yay !!! [09:34:43] Take some waves for addshore [09:40:01] aqs upgraded :) [09:40:30] Great :) [09:43:21] Thanks elukey :) [09:45:16] elukey: My monitoring of cassandra makes me kinda unhappy [09:45:52] elukey: While compactions happen, the number of compaction tasks doesn't start to go down :( [09:46:09] elukey: So far, it seems the time we earn at loading, we loose at compacting [09:50:22] joal: maybe the sstables streamed are not in the right way? I know it is pretty generic :) [09:50:42] elukey: I'll ask urandom [09:50:54] elukey: I think they might be too small, but I'm not sure [10:12:17] Analytics, Operations, Performance-Team, Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2440940 (MoritzMuehlenhoff) p:Triage>Normal [11:10:33] * elukey lunch! [11:50:20] is there anybody that already sent the Berlin expence? [11:50:32] *expense report [12:03:03] I didn't get if I need to send the report to Nuria and the CC others or not [12:03:06] mmmm [12:08:56] ah no I've read a section but didn't see the yellow box [12:08:58] nevermind [12:17:15] joal: if you have time https://gerrit.wikimedia.org/r/#/c/297807/ :) [12:20:05] Analytics-Kanban, Patch-For-Review: Upgrade AQS node version to 4.4.6 - https://phabricator.wikimedia.org/T139493#2433953 (elukey) All the aqs nodes have been upgraded to node 4.4.6, last step is to merge Dan's code review to avoid scap failures. [12:20:30] Analytics, Analytics-Cluster, Analytics-Kanban, Deployment-Systems, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2441200 (elukey) [12:21:20] elukey: I can +2, but I think it's not wise to deploy on Friday, so maybe even +2 next Monday ? [12:22:09] sure :) [12:25:52] elukey: With a core compnent upgrade, I'd rather be safe and have a quiet weekend :) [12:25:56] elukey: Thanks for that :) [12:38:51] mforns: here? [12:41:13] joal: we had a quiet week, why don't we break something :P [12:41:52] elukey: hmmm, while I like to make you happy, I can't say yes to that one :D [12:42:00] hahaahhaha [12:58:17] joal, back [13:03:54] Analytics, Analytics-Cluster, Analytics-Kanban, Deployment-Systems, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2441318 (elukey) Asking some trivial questions since I am a bit ignorant about scap :) The analytics refinery (https://phabricator.wi... [13:10:59] (PS3) Mforns: [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) [13:12:20] (CR) jenkins-bot: [V: -1] [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) (owner: Mforns) [13:14:31] (PS4) Mforns: [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) [13:15:03] Hey mforns, sorry, was away [13:15:08] (CR) jenkins-bot: [V: -1] [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) (owner: Mforns) [13:25:52] mforns: o/ [13:25:59] lol [13:26:08] (re: breaking stuff to make Luca happy) [13:26:11] milimetric: ;) [13:26:38] hey joal, congratulations to your very awesome team [13:27:15] Germany played better 90% of the first half and Les Bleus decided they were gonna win, so fun to watch [13:27:26] joal: I sent a gdrive notification for the Berlin pics [13:27:32] milimetric: Thanks ! Was fun indeed [13:27:37] would you mind to download it and check if it works? [13:28:12] milimetric: It's interesting how sometimes, less organised can still make more than very organised :) [13:29:07] elukey: downloading, so far so good (with my poor connection) [13:29:18] gooood! [13:32:24] Analytics: User History: Solve the fixed-point algorithm's long tail problem - https://phabricator.wikimedia.org/T139745#2441389 (mforns) [13:32:51] mforns: Marceeeeeel, please, talk to me ! :) [13:35:10] ottomata: haha, we sent like the same email at the same exact time :) [13:35:28] joal, !!!!! [13:35:29] sorry [13:35:35] huhu mforns :) [13:35:44] no headphones... [13:35:55] hi! what's up? [13:35:56] no problem mforns, was just making fun of the two us chasing each other :) [13:36:03] xDDD [13:36:08] just created a task for you [13:36:14] I think I have a generic fix point :D [13:36:15] give you more work [13:36:20] oooooh [13:36:23] already? [13:36:37] * milimetric sits down around the campfire to hear tales of generic fixed points [13:36:42] xD [13:36:45] * joal is a bit ashamed to have worked scala instead of oozie [13:37:14] I think basically no human being would ever blame you for choosing beautiful scala over ... oozie :) [13:37:14] * mforns is thinking of something to say about /me [13:37:20] huhu [13:37:46] batcave? [13:37:47] oh haha [13:37:54] mforns, milimetric : Yay, + c9,io [13:41:12] milimetric: coming to batcave? [13:41:45] oh, sure, omw [13:42:02] Analytics-Kanban: User History: Solve the fixed-point algorithm's long tail problem - https://phabricator.wikimedia.org/T139745#2441425 (JAllemandou) [13:42:50] Analytics-Kanban, Patch-For-Review: User History: write scala for user history reconstruction algorithm - https://phabricator.wikimedia.org/T138861#2441429 (mforns) I think the algorithm is in a solid stage now. The generated data makes sense so far (with the analysis I've done), and includes: - User cr... [13:44:44] joal: I'd like to merge https://gerrit.wikimedia.org/r/#/c/297980/2 right in about 30 mins, so all the vk in maps and misc will be restarted [13:45:44] elukey: no prob by me [13:45:49] super [13:46:37] ottomata: analytics1049 is still down because another disk failure (https://phabricator.wikimedia.org/T137273). I tried to contact Chris during these days but he is probably busy, if you talk with him can you bring the phab task up? [13:46:45] (also o/) [13:48:25] oh man, hahah [13:48:26] ok ja [13:48:53] yeah and I also rebooted it during the last round of security upgrades [13:49:08] it doesn't come up :( [14:39:45] mforns, milimetric: Do we take a few minutes for schemas? [14:40:02] sure [14:40:06] joal, ok [14:40:06] https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave-2 [14:40:12] Batcave is busy ! [14:40:16] https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave-2 [14:40:17] joal, ok [14:40:24] oh, man, lag [14:40:36] people varnishkafka in misc and maps is getting the new config to filter out websockets upgrade requests [14:54:09] nice! [14:55:34] HaeB: yt? [14:55:51] hi, what's up? [15:00:02] vk restarts completed! [15:03:01] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2441605 (elukey) Websocket upgrades filtered out from Varnishkafka. The last step is to wait for https://gerrit.wikimedia.org/r/#/c/... [15:08:35] elukey: nice, now we wait.. :) [15:09:04] this should reduce some of the misc webrequest alert emails, ja? [15:09:17] ottomata: sadly we're going to get oozie alerts anyway, because of the release downloads that triggers VLS timeouts [15:09:24] yes definitely it should [15:17:26] query nuria_ [15:17:34] nope / missing :P [15:18:22] sorry on meeting [15:18:35] sure :) [15:28:28] Analytics, Research-and-Data, Research-management: Draft announcement for wikistats transition plan - https://phabricator.wikimedia.org/T128870#2441697 (DarTar) All announcements about the transition have been posted, closing this. [15:46:43] Analytics: User History: Add history of annonymous users? - https://phabricator.wikimedia.org/T139760#2441784 (mforns) [15:47:38] Analytics: User History: Populate the causedByUserId and causedByUserName fields in 'create' states. - https://phabricator.wikimedia.org/T139761#2441799 (mforns) [15:50:14] ottomata, nuria_ : If one of you have some time, it'd be great to have merge or comments on those two CR (https://gerrit.wikimedia.org/r/#/c/297566/, https://gerrit.wikimedia.org/r/#/c/296407/) [15:50:49] (CR) Joal: [C: 1] "LGTM, Tested." [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (https://phabricator.wikimedia.org/T138500) (owner: Addshore) [15:51:03] ottomata, nuria_ : Code has been tested with addshore, I +1 the oozie but prefer not to be the only one to have read it [15:51:14] looking [15:51:24] thanks ottomata :) [15:51:41] Analytics: User History: Create documentation for the user history page - https://phabricator.wikimedia.org/T139763#2441830 (mforns) [15:54:10] joal: this all looks totaly fine to me [15:54:16] shall i merge? [15:54:24] ottomata: Please do :) [15:54:45] ottomata: I hope to have the Druid Loading stuff ready early next, and we'll deploy all that together [15:54:58] (CR) Ottomata: [C: 2 V: 2] Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (https://phabricator.wikimedia.org/T138500) (owner: Addshore) [15:57:13] wow right before standup my internet has decided to get all crappy [15:58:05] ottomata: Arf :( [15:58:41] ottomata: sorry to bother, since you merged the oozie one, I'd need the scala one too :) [15:59:01] ottomata: https://gerrit.wikimedia.org/r/#/c/297566/ [16:00:53] (CR) Ottomata: [C: 2] Update WikidataArticlePlaceholderMetrics params [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297566 (owner: Joal) [16:00:57] done [16:00:58] joal: standduppp [16:01:00] Thanks mate [16:01:09] Joining [16:11:08] Analytics-Kanban, Patch-For-Review: Extract edit oriented data from MySQL for small wiki - https://phabricator.wikimedia.org/T134790#2441913 (mforns) [16:11:10] Analytics: User History: Create documentation for the user history page - https://phabricator.wikimedia.org/T139763#2441912 (mforns) [16:11:26] Analytics: User History: Populate the causedByUserId and causedByUserName fields in 'create' states. - https://phabricator.wikimedia.org/T139761#2441916 (mforns) [16:11:28] Analytics-Kanban, Patch-For-Review: Extract edit oriented data from MySQL for small wiki - https://phabricator.wikimedia.org/T134790#2277147 (mforns) [16:11:39] Analytics: User History: Add history of annonymous users? - https://phabricator.wikimedia.org/T139760#2441918 (mforns) [16:11:41] Analytics-Kanban, Patch-For-Review: Extract edit oriented data from MySQL for small wiki - https://phabricator.wikimedia.org/T134790#2277147 (mforns) [16:18:02] Analytics-Kanban: Retention metric research - https://phabricator.wikimedia.org/T138611#2441940 (Nuria) a:Nuria [16:31:30] nuria_: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#High_Availability [16:36:19] milimetric, ready! [16:36:28] omw [16:36:31] I'm in da caiv [16:42:14] nuria_: about? [16:42:20] joining [17:00:43] milimetric: I am going to start working on the "lowercase project" pageview APi issue, Ok? [17:13:13] nuria_: sure, but lemme check ou the code, that's not done yet? [17:14:49] nuria_: ok, yeah, we kept forgetting to do that, sorry, here's where you'd put it: https://github.com/wikimedia/analytics-aqs/blob/master/lib/aqsUtil.js#L54 [17:46:31] milimetric: at the cassandra layer, rather than restbase? [17:48:16] oh... uh, I think we already normalize project names before they get inserted into Cassandra [17:48:36] I haven't seen the code myself but I think we do that when we refine [17:48:49] the problem is more people querying for EN.wikipedia.OrG or something right? [17:56:28] sorry forgot to ping, nuria_ ^ [17:56:45] (I'm gonna go grab some food now, but I'll be right back) [17:56:56] milimetric: right, but I though teh place to normalize quries would be restbase [17:57:02] milimetric: k np [17:58:07] nuria_: yeah, the code I pointed to is restbase, the util function that normalizes the {project} parameter right now [17:58:24] milimetric: ah sorry, ok i get it now [17:58:38] milimetric: my mistake [18:44:02] HaeB: yt? [18:44:20] still am ;) [18:46:34] madhuvishy: wondering if you have ever run into this problem [18:46:44] eventlogging beta on deployment-eventlogging03 [18:46:51] is loading the wrong code path [18:46:54] if i run ipython [18:46:57] and print sys.path [18:46:59] i see [18:47:03] /srv/deployment/eventlogging/eventlogging [18:47:05] in the path [18:47:08] before the correct one [18:47:14] which is /srv/deployment/eventlogging/analytics [18:47:22] and im' not sure how eventlogging/eventlogging is getting into the path! [18:47:29] aah [18:47:31] i suspect it is something fancy i did a long time ago [18:47:35] but i don't know! [18:47:36] no i haven't run into that [18:47:38] hm [18:47:43] its not in /etc/profile.d [18:47:49] it seems to be global [18:47:53] as root it is in my path too [18:48:06] oh interesting [18:48:16] hmmm [18:48:34] i thoguht mabye it was just me [18:48:42] but even the upstart managed daemons are running from there [18:48:48] they launch from the correct bin/ script [18:48:50] oh [18:48:55] but the libs are loaded from eventlogging/eventlogging [18:49:06] even though the upstart scripts export PYTHONPATH=.../eventlogging/analytics [18:49:27] because somethign is sticking el/el in the sys.path [18:49:33] right [18:54:45] HaeB: I got some questions about how to [18:55:04] HaeB: compute "overall unique devices" with our data [18:55:13] ottomata: https://github.com/wikimedia/operations-puppet/blob/b34689bda1f5c201f1cbe7467e9207fdc71f43d9/modules/eventlogging/manifests/server.pp#L54 [18:55:31] HaeB: so i shared your approach, with many caveats saying it was WIP blah blah [18:55:57] madhuvishy: ja but, that is just for setting where to run eventlogging services from [18:56:05] and it is overridden in the use in the role [18:56:08] to eventlogging/analytics [18:56:17] and there's nothing in there that i can find that is altering the sys.path [18:56:20] ya right was just going to go look at role [18:57:25] yeah [18:58:33] FOUND IT [18:58:40] /usr/local/lib/python2.7/dist-packages/easy-install.pth [18:58:46] /srv/deployment/eventlogging/eventlogging [18:58:49] i think leftover from a pip install [18:59:00] pip uninstall didn't remove it [18:59:01] dunno why [18:59:52] nuria_: cool (shared with whom?) [19:00:14] saw your comments at https://phabricator.wikimedia.org/T138027 btw... much appreciated, i may still weigh in too [19:00:17] aah [19:00:24] HaeB: Denny Vrandečić [19:00:55] HaeB: not sure of irc [19:01:56] it seems he's in this channel ;) [19:02:37] HaeB: i think your approach overall it is quite a smart usage of data and he is also a good person to run it by [19:02:53] HaeB: ah yes cc dennyvrandecic [19:03:02] hi denny! [19:03:52] phew, madhuvishy got it, had to manually remove all the old installed eggs and bin files [19:04:00] HaeB: ya, the overall cookie might happen but it is true we want to avoid overproliferation of cookies if possible [19:04:12] ottomata: ah the joy of python eggs [19:07:52] :) [19:15:48] Analytics-Kanban, Patch-For-Review: Event Logging doesn't handle kafka nodes restart cleanly - https://phabricator.wikimedia.org/T133779#2442626 (Ottomata) Deployed and running in beta! [19:32:37] HaeB: hi! [19:36:20] dennyvrandecic: yes, so this is an idea for how to get the best possible lower bound for the (unknown) number of global unique devices from the available data. as nuria may have mentioned, we havent put the resulting number forward publicly yet [19:37:18] HaeB: understood. the sum(top(project-uniques))-approach? [19:40:42] milimetric: the aqs stuff is not in gerrit right? [19:40:51] milimetric: we do pull requests correct? https://github.com/wikimedia/analytics-aqs [19:40:57] nuria_: no, it's in gerrit now [19:41:03] aoaoaoa [19:41:06] k [19:41:18] they refactored a while back and we decided to keep everything we maintain consistently in gerrit [19:41:28] aqs is actually really easy / fun to work with now, you can forget all the old complications [19:41:45] (including unit testing, etc, all easy) [19:42:09] so nuria_ if something doesn't seem obvious or easy let me know and I'll improve the docs [19:42:26] milimetric: np, will do some work on that now [19:45:11] HaeB: what is the question? :) [19:47:12] dennyvrandecic: yes, i was talking about the sum(top project uniques) approach. and extending the last-access cookie to all of wikipedia.org (https://phabricator.wikimedia.org/T138027 ) would give global numbers for wikipedia, at least (which is probably very close to the number for all wikimedia project) [19:47:40] i don't have questions, but nuria mentioned above she had some and shared it with you for that purpose? [19:48:13] HaeB: yes, we discussed it a bit, and had a few ideas for trying to normalize and sanity check the idea, but nothing world-shaking [19:48:49] cool, would be happy to hear them [19:49:27] HaeB: things like taking the country data and normalize it with "how many people have internet access" country level data, and then see if there are outliers that need further investigation [19:49:53] HaeB: but this are more issues with the sum-top approach, less with the global domain cookie [19:51:57] yes, might be interesting. of course we know that WP has uneven popularity per country, as measured in pageviews per internet user, or indeed pageviews per unique devices... [19:52:01] HaeB: I also expect the global domain cookie approach on wikipedia to be very close to wikimedia overall (although it would be lovely to have a real wikimedia-overall numbers, but x-domain cookies seem rather no-no for several reasons) [19:52:38] ...my colleague anne gomez is in fact looking at that as a possible input to identify underserved (by wikimedia) countries as part of WMF's "New Readers" projects [19:53:00] *nod* [19:53:15] HaeB: sure! but there is the question whether the outliers are real (i.e. underserved) or artefacts of s.th. else [19:53:50] yep, might be interesting to check [19:54:19] HaeB: e.g. countries with a strong language split, e.g. Uzbekistan, Kasachstan, would probably look different from more linguistically homogeneous countries (e.g. Japan), and that would not be due to underserving but due to measurement method [19:55:02] I am sure Anne is aware of that, though - the list would be just an invitation for further investigation anyway, not a list of decisions :) [19:55:35] the other point is I am still not totally convinced by the unique-devices metric itself, which is underlying a lot of these metrics [19:55:49] but that wouldn't be specific to either of these aggregation methods [19:55:56] and not sure if we want to go into that now [19:55:59] ...this would probably go back to the reliability of the unique devices metric itself (not mainly the summing approach). it relies on some assumptions that may be more valid in some countries than others https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices/Last_access_solution [19:56:30] agreed [19:57:20] the other thing I would love to see is a projection in the past of these numbers, based on some correlated metric [19:57:38] ("The signature is calculated with a hash of (ip, user_agent, accept_language) per project" - last time i checked t-online in germany was still changing consumers' IP4s daily, for example) [19:58:40] HaeB: it doesn't matter whether your Ip chnages daily [19:59:10] HaeB: yeah, also I am unsure with how many user_agents a US-phone user with the twitter app, fb app, google search app, and chrome would be counted, given that they might have different user agents. nuria says that these seem negligible though, which is good [19:59:18] HaeB: once you have a cookie (2nd request on a 30 day period) teh Ip is of no consequence [20:00:33] HaeB: so "daily changes of IP" have no effect on overall count [20:01:02] *the IP [20:01:54] nuria_: user reads every morning on commute, hits three or four wikipedia articles, gets counted. next day, she does the same thing, IPs has changed, she is counted again. am i misunderstanding, HaeB? [20:02:14] nuria_: hmm ok we don't need to sort this out right now, but what about if i'm browsing with cookies off and come back with a new IP - won't that be counted as a new device? (" we also count as uniques requests with nocookie=1 whose signature appears only once in a day or month.") [20:02:28] dennyvrandecic:you are missunderstanding yes [20:02:43] dennyvrandecic: she is only counted once on a monthly period [20:02:49] i havent thought about this in depth, but if that's not the case, the documentation should be clarified [20:03:02] nuria_: ah right because she has the cookie (facepalm) sorry, my bad [20:03:14] dennyvrandecic: exactly [20:06:46] dennyvrandecic, HaeB : Users browsing with cookies off whose IPs are changing continously would be overcounted but the more common pattern for "users browsing with cookies off" are users browsing from their desktop browser (in which case their IP is not changing that much). Cookies off is not a common pattern. [20:07:28] Those users would be overreported just like users coming up under the same Ip on mobiule due to NAT-ing are underreported [20:08:00] I think HaeB's point is that German T-Online users do change their IP often [20:08:00] yeah well i was just saying that home IPs for germany's largest ISPs are in fact (or were until fairly recently) changing every 24 hours [20:08:54] HaeB: as i said that for the bulk of our userbase in germany is not significant [20:09:21] HaeB: as the majority of users do accept cookies [20:09:43] HaeB: the percentage of cookie-less browsing would vary country per country [20:09:57] HaeB: but I think we better worry about cookies being evicted rather than not used [20:10:14] HaeB: even if the 30 day period has not expired [20:13:08] nuria: ok, do we have a hunch about how large this percentage is? (agree we shouldn't worry too much about miscounting by a few percent... but the concern might be that just a few of these users could distort the stats disproportionally) [20:14:00] HaeB: cookies off distorting stats disproportionally or mobile NAT-ing distorting stats disproportionally? [20:14:09] again though, we don't need to sort that out right now... perhaps best discussed in form of a thread at the talk page https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices/Last_access_solution (for later references) [20:14:11] cookies off [20:23:14] HaeB: for "overall estimation of users with cookies off " you have to rely in overall internet numbers. If you go with the assumption that users that have cookies off might also have javascript off or "DNT on" you can refer to those numbers and use them as a proxy. we calculated % of request with js off two years ago and it was <5% (of requests not devices). [20:23:21] HaeB: there is a more recent study: https://upload.wikimedia.org/wikipedia/commons/e/e6/Analysis_of_Wikipedia_Portal_Traffic_and_JavaScript_Support.pdf [20:26:01] HaeB: sorry , was miss-rememebering, our study was pretty rough but it came of ~3% without js enabled feb 2015 [20:27:44] HaeB: here is is, the other one i linked to will provide better estimates: https://www.mediawiki.org/wiki/Analytics/Reports/Clients_without_JavaScript [20:28:09] HaeB: there are -again- requests, not users [20:31:20] HaeB: overall the percentage of users browsing with cookies off is likely orders of magnitude lower than users deleting their cookies [20:31:42] HaeB: and that is a problem in any counting method anytime anywhere that you might want to use [20:32:02] ok, interesting data! [20:32:22] HaeB: see cookie deletion and effects on google analytics data: https://www.e-nor.com/blog/google-analytics/cookies-and-cookie-deletion-in-google-analytics [20:32:23] so if we imagine a population of 100 readers with one pageview daily each, and daily changing IPs... [20:32:48] ...or devices instead of readers... [20:32:55] sure [20:33:00] busted! [20:33:03] ...and three of them have cookies disabled... [20:33:05] you own 1$ [20:33:28] that seems far too high [20:33:33] but sure [20:33:35] we would measure 187 instead of 100 devices? [20:34:06] yeah so i took the 3% from your assumption above. and of course this is all a bit hypthetical [20:34:16] 3% of requests [20:34:26] not devices but ok [20:36:06] that scenario will overcount devices, sure [20:36:12] cc HaeB [20:37:01] milimetric: for aqs before you could change backend to be sqllite instead of cassandra [20:37:57] milimetric: can we do that still? [20:37:58] nuria_: I think it's still like that, I don't have cassandra and the unit tests run fine [20:38:11] milimetric: k, might be the default then [20:38:29] yeah, nuria_ run "npm test" and it'll be like: [20:38:31] Cassandra not available. Using SQLite backed for tests [20:38:31] Running with SQLite backend [20:38:59] milimetric: right, but my tests fail, do your succeed? [20:39:28] *yours [20:39:57] milimetric: if they fail too no worries i can try to fix those too, that way i learn a bit more how is all this set up [20:40:30] nuria_: mine succeed, did you rm -rf node_modules and npm install? [20:40:52] milimetric: I did a clean install [20:41:19] milimetric: do you want to do npm update and see if they still succeed with latest master code? [20:42:30] milimetric: no worries is some dependency node module taht is not updated, will look into it [22:12:52] (PS5) Milimetric: [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) [22:13:28] (CR) jenkins-bot: [V: -1] [WIP] Process Mediawiki page history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295693 (https://phabricator.wikimedia.org/T134790) (owner: Milimetric)