[15:19:30] morning guys [15:20:50] morning! [15:20:54] are you still in SF drdee? [15:21:02] no but my bio clock is :D [15:21:23] drdee: morning [15:21:33] drdee: I am getting the 500M bump using Evan's code as well [15:21:43] dear lord, daylight saving, mini jet lag, i was just very tired [15:22:03] average_drifter: morning, and i guess that's good news :D or :( [15:23:52] in a way it's good news [15:24:12] but when I asked Evan "are you getting the 500M bump?" [15:24:44] he said "no, I'm not. Limn does not show this" [15:25:19] [travis-ci] master/256ede7 (#76 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5412620 [15:28:11] ottomata, wanna deploy filter? [15:29:44] yeah! [15:29:44] ok [15:33:51] should I make the package for webstatscollector? [15:35:07] that might be useful, since we have to puppetize the stuff on locke soon [15:35:22] can you just make it install the few binaries into /usr/local/bin ? [15:35:24] as is? [15:35:57] [travis-ci] master/9f329b2 (#77 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5412884 [15:36:06] ottomata: yes [15:36:09] average_drifter, did you push your fixes? [15:36:21] ottomata: the fixes got merged through gerrit [15:36:27] k [15:36:27] ottomata: drdee merged them [15:36:35] no let's not package this [15:36:38] right now [15:36:44] we should first see if it actually works [15:36:45] ok [15:38:53] ok, deploying new fitler, watching logs [15:39:25] bwerrrp [15:39:25] Pipe restarted: /a/webstats/bin/filter | log2udp -h 127.0.0.1 -p 3815 [15:39:26] Segmentation fault [15:40:05] undeploying [15:41:25] shoudl I be on master branch or time travel branch? [15:41:34] average_drifter, drdee? [15:41:51] time_traverl [15:41:54] time_travel [15:42:02] i thought you guys fixed master and were using that now [15:42:03] that was where the fixes got merged [15:44:01] ok, deplying again [15:45:10] looking better! [15:46:51] :) [15:47:01] morning everyone [15:47:06] * milimetric is super jetlagged [15:47:45] morning milimetric [15:48:05] morning drdee [15:48:07] wait guys [15:48:18] you wait guys! [15:48:22] yeah! [15:49:34] well that was anticlimactic. I waited and nothing happened [15:49:51] sorry [15:49:58] ottomata, I'm told I should try to learn Oozie, so I'm reading your wiki page [15:50:36] oh yeah! [15:50:39] let's def talk [15:50:41] i would love love love [15:50:44] more help in this area [15:50:55] (i don't really want to do analysis :p) [15:51:29] i started to give dschoon a runthrough on friday [15:53:25] can I look on said wikipage as well ? [15:53:42] averag_drifter: yes you should [15:53:56] ottomata: maybe you can teach both milimetric and average_drifter [15:54:08] https://www.mediawiki.org/wiki/Analytics/Kraken/Oozie#An_Example [15:54:16] oops [15:54:16] https://www.mediawiki.org/wiki/Analytics/Kraken/Oozie [15:55:12] average_drifter, oozie page ^ [15:56:30] yup! and dschoon [15:56:31] would love to [15:56:57] milimetric: thanks [15:57:28] about webstatscollector git repo [15:57:42] ^demon and I switched time_travel and master [15:57:45] dschoon will be online for standup, maybe we can do it right after [15:57:59] he has to go somewhere [15:58:01] i have to leave a bit early today, i'm flying to mexico this eve [15:58:08] yeah but he said he'd be at standup from home [15:58:13] and then leave :) [15:58:19] so master is now good for development [15:58:39] the old master is called "old_master" and contains all average_drifter's work [15:58:51] and iirc we deleted the time_travel branch [15:58:57] sorry for not having emailed this [15:59:14] bottom line: master matches deployment [15:59:21] no more funky branch business [16:00:47] average_drifter, ottomata; does that make sense? [16:02:25] ottomata: wanna talk about locke & europium? [16:02:30] i'd recommend staying on a topic branch then until the code is stable before merging it to master [16:03:01] but maybe first reading our new and improved Gerrit wiki pages to see how to best do that [16:03:45] milimetric:yes for local branches but this was about remote branches [16:03:51] right [16:04:19] drdee, yeah that's cool [16:04:28] that's what I thought we were doing, that's why I deployed originally from master branch [16:04:34] great! that's right [16:04:37] (originally === today) [16:04:42] drdee, sure [16:04:48] my main q about locke right now [16:05:01] is if we can use the multicast stream, instead of deploying new configs to all frontends [16:05:13] I will respond to that RT ticket with that Q, and ask mark and asher [16:05:24] k [16:12:21] 12 doesn't make sense, we don't have Edit UI [16:12:43] 9 is not really a bug, more of an enhancement [16:13:22] delete 12? [16:13:30] 10 was fixed [16:13:33] but not really [16:13:48] drdee: ok, so master is the main branch for webstatscollector [16:14:01] average_drifter: yes it's back to business [16:14:08] basically if you have a bad graph definition on the dashboard, it makes other graphs do weir stuff still, so I can rephrase that one [16:15:29] oh i see, kraig moved all of these drdee [16:15:37] ok [16:15:46] kraig and i have been organizing them [16:16:03] by theme, we are not yet done [16:16:30] I am looking forward to talking to erosen [16:16:47] ottomata , milimetric , drdee he will be here today right? :) [16:17:03] yes he will be [16:17:09] awesome [16:17:19] milimetric: 104 is done right? [16:17:36] so drdee I wouldn't move those unless they're bothering someone [16:17:36] they're pretty low priority [16:17:48] i am not moving anything :) [16:18:04] just trying to get an accurate picture of what's still relevant [16:18:44] yep drdee, marked it done [16:19:35] can you help me combing through the tasks and close stuff that's either done or no longer relevant? [16:22:17] 95 still an issue? [16:23:09] "Reportcard: Depend on Limn properly" [16:47:45] dschoon, drdee, do you remember the names of the 3 phases of kraken cluster we defined in the arch review? [16:47:58] is MVC what we have now? [16:48:03] MVC, Initial Base Cluster, Prod [16:48:04] yes. [16:48:06] ok cool [16:48:10] danke [16:50:03] and Initial Base Cluster is what we have now, just puppetized? [16:51:36] mostly. [16:51:40] check the meeting notes [16:51:44] it has an outline [16:55:22] Prod is officially called Target Cluster [16:56:48] ah, oh well [16:56:49] just sent [16:56:56] yeah i was looking for meeting notes…did you email thos? [16:57:36] it's in Mingle actually, hang on lemme try to find the view [16:58:01] https://mingle.corp.wikimedia.org/projects/analytics/cards?favorite_id=722&view=MVC+Value+Path [16:58:06] ottomata, dschoon ^ [16:58:31] so there are like 5 things under Initial Base Cluster [16:59:10] yep, that's right. [17:01:09] ottomata hangout? [17:01:39] loooaooding [17:06:17] there it goes [17:09:18] average_drifter, wanna join the hangout? [17:09:23] we are going to talk about oozie [17:09:29] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [17:09:48] kraigparkinson: https://plus.google.com/hangouts/_/7fecaf1539705fe253a0c0359d83ab820e674e33 [17:18:11] sure [17:18:14] still there ? [17:28:00] * YuviPanda pokes milimetric [17:28:02] hey! [17:28:07] hey YuviPanda [17:28:19] I'm on a call right now [17:28:25] but what's up [17:28:45] milimetric: when will the call be over? [17:28:55] it's sort of ad-hoc, so we don't know [17:29:00] milimetric: so yeah, the reportcard was flat out for a while. [17:29:04] we're doing a team training on oozie [17:29:05] I reverted two of my last changes [17:29:15] and then I deployed [17:29:19] i checked but didn't find any issues [17:29:23] err [17:29:25] any fidxes [17:29:26] right, if you could write a script with the commands that you run [17:29:27] it was still flatlines [17:29:34] I have written one :) [17:29:43] oh, where's that? [17:29:45] milimetric: seems to be a cache problem somewhere. [17:29:58] milimetric: tfinc just saw the mobile reportcard, and saw the data from reportcard.wmflabs.org instead [17:30:02] a hard refresh fixed it [17:30:29] so where's the script to reproduce the problem? [17:30:47] milimetric: it doesn't reproduce the problem [17:30:51] hmm [17:30:51] wait [17:31:02] the problem seems intermittent, rather [17:31:07] one moment, let me see if I can repro it [17:31:31] sure, even if it's intermittent, I would need the steps because I don't know what steps you're taking with limnpy, etc. [17:32:21] yeah [17:32:22] i know [17:32:37] milimetric: http://pastebin.com/r2UfgPE5 is my update script [17:33:02] I run that from localhost [17:33:02] not the most elegant [17:33:15] ok, cool, that helps [17:33:19] i'll take a look in a sec [17:33:35] i'm running it now [17:33:38] to see if the problem recurs [17:33:44] re-occurs? [17:36:59] milimetric: http://mobile-reportcard-dev.wmflabs.org/ is flat again now [17:37:01] I just ran that script [17:37:07] checking [17:37:45] milimetric: I think it was changes introduced in the code in 876ed9191569bb76508b2089e69ff4e7c7d915d0 and corresponding limnpy changes to support that one frome erosen might be causing it? [17:38:00] since when I reverted the data runs after that, and deployed it fixed itself [17:38:06] hmm [17:38:09] yeah, it looks like the graph files are broken erosen / YuviPanda [17:38:10] :) [17:38:16] we'll have to debug limnpy [17:38:32] milimetric: can you tell me more about the error? [17:39:09] erosen: see mobile-reportcard-dev.wmflabs.org [17:39:15] erosen: data is tehre but graphs are empty [17:39:19] * erosen looking now [17:39:30] erosen: so I reverted the data to state before the latest limnpy update and it works [17:39:55] interesting... [17:40:11] so is the current data the broken data as far as you know? it isn't working fo rme [17:40:16] YuviPanda: ^^ [17:40:31] yeah, current data is broken [17:40:38] or rather, current graph files are broken [17:41:19] ya [17:41:45] yep [17:42:00] YuviPanda: this error "Uncaught TypeError: Cannot call method 'toFixed' of undefined " looks familiar [17:42:09] it is usually what happens when the datasource url is wrong [17:42:14] hmm [17:42:24] but the datasource seems to have not changed? [17:42:35] and that was part of the most recent limnpy changes [17:42:38] hmm [17:42:59] yeah, I updated to latest [17:47:50] YuviPanda: problem identified, fixing presently [17:50:20] erosen: sweeet :) [17:55:48] drdee [17:55:53] ottomata [17:55:54] is there a new UDF with the zero pig stuff? [17:55:58] yes [17:56:01] do I need to update .jars? [17:56:04] in hdfs? [17:56:17] yes [17:56:23] erosen: what does the zero pig do? [17:56:56] it does a lookup of the 'cs' key and retrieves as value the carrier name [17:57:31] drdee, can you do me a favor? [17:57:34] when you make new UDFs [17:57:36] always [17:57:38] can you put them in /libs in hdfs? [17:57:40] sure [17:57:41] average_drifter: did that make sense? [17:57:48] danke, that way I don't have to figure out how to build them :) [17:57:50] average_drifter: really drdee is the one to ask [17:57:54] ottomata 1 sec [17:57:56] k [17:58:03] erosen: I think so yes [18:04:52] ottomata, filter was deployed succesfully? [18:15:30] yup [18:18:35] hmm, drdee [18:18:35] org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Could not resolve org.wikimedia.analytics.kraken.pig.GeoIpLookupEvalFunc u [18:18:36] sing imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] [18:18:37] any clues? [18:18:45] this is from device job [18:18:48] not zero [18:18:51] one of the many failed workflows [18:19:04] are the dat files in the user directory on hdfs? [18:19:17] yes [18:19:24] and, many of the workflows succeed [18:19:31] can you put the log file in a gist? [18:19:42] (the pig log file i mean) [18:21:18] it is 121K [18:21:21] i will email it to you [18:22:12] emailed. [18:22:31] thx [18:24:45] which pig script are you using? [18:24:52] oh wiat [18:24:54] webrequest_zero_hour_carrier_country.pig [18:25:18] (previously named webrequest_zero_hourly.pig) [18:25:27] OH [18:25:27] sorry [18:25:28] no [18:25:32] that is the one we are about to try [18:25:34] we are using [18:25:38] webrequest_hour_country_device_vendor.pig [18:26:51] erosen: I need to run the embr_py on stat1 [18:27:02] erosen: how can I install with pip stuff on stat1 on my local user please ? [18:27:10] pip install --user [18:27:13] thanks [18:27:27] optionally with -e . [18:27:39] erosen: like for example if I want to install netaddr [18:27:43] erosen: what would I run ? [18:27:44] aah [18:27:57] i htink you should be able to install squidpy [18:28:10] but for netaddr you should be able to just do `pip install --user netaddr` [18:29:48] erosen: it works, thanks :) [18:29:53] nice [18:31:59] erosen: how can I tell squidrow.py to not geolocate ? commenting line 95 ? [18:32:05] hmm [18:32:08] 95: load_pygeoip_archive() [18:32:09] it shouldn't geolocate by default [18:32:11] oooh [18:32:21] you can just comment that out [18:32:23] ok [18:32:40] it doesn't actually do geolocation, but just loads the maxmind files [18:33:14] erosen: yes, true, but I don't have those in my home on stat1.. and it would take some time to gunzip them or copy them.. [18:33:26] erosen: can we have cidr_ranges.json in our github repo ? [18:33:32] it should just fail [18:33:40] we can't have cidir_ranges publicly [18:33:46] because it has carrier ip ranges [18:33:48] I was otld [18:33:50] ok, how about mcc_mnc.json ? [18:33:54] otld = ? [18:33:55] old ? [18:34:14] mcc_mnc.json is fine to have publicly [18:34:22] but there is a script which dynamically generates it [18:34:24] ok, I'll commit mcc_mnc.json then [18:34:28] sounds good [18:42:22] drdee, ok i just ran pig script with all jars in my local dir [18:42:23] Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 [18:42:24] at java.util.ArrayList.RangeCheck(ArrayList.java:547) [18:42:24] at java.util.ArrayList.get(ArrayList.java:322) [18:42:24] at org.wikimedia.analytics.kraken.pig.GeoIpLookupEvalFunc.outputSchema(GeoIpLookupEvalFunc.java:203) [18:43:03] while (geoIpLookup.getNeededGeoFieldNames().iterator().hasNext()) { [18:43:03] GeoIpLookupField field = geoIpLookup.getNeededGeoFieldNames().get(i); [18:52:17] erosen, YuviPanda, did you guys come to a good resolution of the limnpy graph generation problem? [18:52:34] milimetric: just get back to it after the oozie fun [18:52:52] let me know if i can help [18:52:56] :) [18:52:58] or if you're busy i can try [18:53:08] milimetric: II [18:53:21] milimetric: It should be fine, i think i have the bug pinned down [18:59:50] YuviPanda: should be fixed now [19:00:01] erosen: updating limnpy [19:00:11] YuviPanda: if you pull limnpy it fixes at least one of the problems ;) [19:01:52] erosen: which one? [19:01:53] erosen: updating [19:02:09] it fixes the format of the graph files [19:02:16] i see source_col fixes [19:02:18] let me deploy [19:02:21] yeah [19:02:23] k [19:02:50] i just say that it fixes at least one problem to leave the possibilty that there may be others [19:02:54] :D [19:04:02] erosen: well, I deployed but still at http://mobile-reportcard-dev.wmflabs.org/ [19:04:07] nothing :( [19:04:10] hrm [19:04:18] works for me.. [19:04:19] erosen: weird [19:04:24] erosen: works for me in incognito [19:04:27] some caching issue somewhere [19:04:39] hard refreshing from the start [19:04:48] haha [19:04:49] yeah i've often had to fully clear cache in chrome [19:04:50] I see errors [19:04:51] in console [19:05:03] in chrome? [19:05:08] Uncaught Error: Cannot make scale shape 'line'! xScale=function; yScale=object [19:05:13] yeah [19:05:14] four of those [19:05:15] and [19:05:17] yeah that was the old error [19:05:19] Uncaught TypeError: Cannot call method 'toFixed' of undefined [19:05:25] yup [19:05:31] those were the two errors I was seeing before the fix [19:05:36] but now I see niether [19:05:42] erosen: yeah, me neither in incognito [19:05:43] how are you clearing your cache? [19:05:52] erosen: ctrl+shift+r [19:06:05] erosen: in incognito I don't see it. [19:06:09] YuviPanda yeah, i've found that method to be unreliable [19:06:18] ugh [19:06:27] shouldn't that technically work? [19:06:32] it should [19:06:35] but it doesn't sometimes [19:06:42] hmmm [19:06:43] sigh [19:06:47] okay [19:06:54] try going into chrome settings -> advanced -> Privacy -> clear brwosing data [19:07:03] and just check empty cache [19:07:06] erosen: yeah, that should work, but I usually don't want to do that ;) [19:07:08] i think that worked for me [19:07:11] it's fine though. [19:07:15] i know what you mean [19:07:17] i'll just make sure to check in incognito [19:09:31] [travis-ci] master/fc6071a (#78 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5419023 [19:09:44] average_drifter: how did the embr_py run go? [19:13:05] DarTar: given what you were telling adam this morning; I take it I should write & deploy my stats collection thing sooner rather than later? [19:13:35] yes please :) [19:14:24] it has been reviewed by all the relevant folks, I really want to avoid that it be put on hold for the next 3 months [19:15:11] but it'd be good to have this as a use case in a discussion with legal abt data from uniques [19:18:43] erosen: still running [19:20:29] erosen: 21/62 days processed [19:20:40] cool [19:21:16] erosen: https://github.com/wikimedia/metrics/blob/mobile_pageviews/pageviews/embr_py/mobile_report.py [19:21:25] erosen: it takes the files to process from STDIN [19:21:50] erosen: then it filters out only the ones with site()=="M" [19:21:53] cool [19:21:58] sounds right [19:22:05] erosen: rdata is a dict of dicts. first key is time and second key is language [19:23:16] nice. you might find a deafultdict or a counter useful for this sort of thing in the future: http://docs.python.org/2/library/collections.html#collections.defaultdict, http://docs.python.org/2/library/collections.html#collections.Counter [19:23:38] reading [19:27:00] drdee, did you see I switched our meetings up a little bit? [19:27:22] today's meetings [19:27:23] ? [19:29:29] yep [19:32:26] anyway, do you have an hour you can spare for me in the next 90 minutes? :) [19:33:58] always! [19:34:09] let me just grab a cup of coffee and then i am good to go [19:35:06] cool, lmk when you're ready [19:38:00] should be ready in 15m [19:42:08] erosen: thanks :) works fine now :) [19:42:34] YuviPanda great, let me know when anything else comes up that points to limnpy as the source of error [19:42:47] yup, will do :) [19:49:36] kraigparkinson: https://plus.google.com/hangouts/_/a9539c089ff0aff0a4b00bceaf75d1cdb75e1614 [19:59:27] hey guys, for the Amsterdam hackathon, are you all doing May 23 - 27th? [20:04:09] I'll be attending that one [20:04:16] awesome average_drifter [20:04:20] I might as well just get tickets now so I know I'll be there [20:04:24] well I registered and set those dates as arrive / depart [20:04:28] at least I won't have visa-related stuff anymore [20:04:47] milimetric: oh cool, is the date still variable or is it fixed ? [20:04:49] drdee, let me know if you wanted to do different dates than above ^ [20:04:58] no, the hackathon is 24,25,26 [20:05:04] nice [20:05:15] so I figure we have to at least get there before and leave after, otherwise we might miss a day [20:05:35] erosen: will you be coming to the a-dam hackathon? [20:06:22] no, he won't [20:06:25] next year :) [20:06:46] oh, re: amsterdam for drdee, dschoon, and ottomata: I'm staying at the hostel [20:07:52] milimetric: yes i;ll be there for sure around that timeframe [20:08:16] awesome [21:08:03] I'm jealous. :) [21:23:58] kraigparkinson: i am in the same hangout [21:24:19] ah, ok be right there. [21:46:20] erosen: processing done [21:46:29] what's the word? [21:46:37] on the bump, that is [21:47:07] ew [21:47:12] heh, sorry, wrong chat [21:47:20] hehe [21:47:30] erosen: http://stat1.wikimedia.org/spetrea/embr_py_reports/r1-mobile.txt [21:48:21] woah, sort of all over the place... [21:48:29] 'en': 27278, [21:48:36] erosen: no no, don't look at that one [21:48:38] 'en': 1380657, [21:48:42] k [21:48:42] erosen: just look at november and december [21:48:45] erosen: english wikipedia november => 1380657 [21:49:03] erosen: english wikipedia december => 2059894 [21:49:04] and then 'en': 2059894, [21:49:05] yeah [21:49:09] yes [21:49:48] well, sorry to have led you astray, I'm a bit confused why this didn't show up in the previous charts [21:50:00] * erosen os finding the link [21:50:35] erosen: actually I think it's good that I was able to look at your code [21:50:38] dschoon: can you take a look at kripke configuration to see why http://gp-dev.wmflabs.org/ is what it is [21:50:45] sure [21:50:56] it should be gp, right? [21:50:56] yeah [21:50:59] erosen: and I could confirm this through your codebase as well [21:51:10] but it points to the kripke webserver default [21:51:17] average_drifter: good point [21:51:42] milimetric: do you happen to get a big bump november => december 2012 ? [21:51:47] milimetric: does it show up in limn ? [21:52:03] http://gp-dev.wmflabs.org/ [21:52:08] apache conf was wrong, erosen [21:52:11] probably my fault [21:52:39] dschoon: no worries, thanks for the quick resolution [21:52:43] np [21:52:56] erosen: could it be that the bump was caused by an ad ? [21:53:11] hrm, it's possible [21:53:56] average_drifter: i'm worried this isn't worth the effort, but have we thought about throwing the requests (without the timestamp) into a classifier and trying to predict whether it is pre / post bump? [21:54:01] and then looking at the highly weighted features? [21:54:33] erosen: that is a very interesting idea [21:54:37] very appealing [21:54:42] but what classifier should we use for this? [21:55:07] when all you've have's a hammer, erosen... [21:55:11] ..... [21:55:17] i would have "classified" by URL prefix [21:55:31] and looked at the top prefixes [21:55:37] sorted by weight [21:55:41] dschoon: we have done all of the simple things first [21:55:46] that's fair :) [21:55:49] I can throw everything in a prefix tree [21:55:55] average_drifter: not a bad idea [21:55:58] yeah, that seems right. [21:56:10] I thought Evan meant a classifier as in Naive Bayes [21:56:11] because traffic patterns are strongly based on URL [21:56:12] or some SVM [21:56:17] he did :) [21:56:17] average_drifter: there is also the stanford classifier which is CLI which uses ngrams on tsvs... [21:56:30] yeah i did … hehe [21:57:17] erosen: is there one ready-made that you trust that I could use ? [21:57:30] there's scikit-learn, scipy, numpy [21:57:33] and some other stuff [21:57:39] and I'm sure those have some classifiers in them [21:57:43] which would you use ? [21:57:59] i'm most familiar with the stanford classifier [21:58:07] and I think it is pretty good at dealing with text [21:58:15] it does all of the preprocessing and feature generation [21:58:31] average_drifter: http://nlp.stanford.edu/software/classifier.shtml [21:58:41] reading [21:59:24] isn't it easier and faster to create some histograms? [21:59:56] drdee: maybe, my concern is that we don't know what to compare [22:00:20] afaik, the suspicious request have mime type '-' [22:00:27] interesting [22:00:30] so let's dig into those first [22:00:41] and most of those are search requests [22:00:53] so are you guys filtering for that? [22:01:02] drdee: as a simpler version, what about just hashing the interesting fields from each log line and then finding buckets with a big difference pre and post bump [22:01:21] drdee: will they still have a /wiki/ url? [22:01:34] i think so yes [22:01:53] or at least /w/ [22:01:55] interesting [22:01:59] can you show my your code [22:02:13] so i can read it with an outsider eye? [22:02:19] for sure [22:02:46] drdee: https://github.com/wikimedia/metrics/blob/master/pageviews/embr_py/count.py [22:03:04] this code uses functions implemented in https://github.com/wikimedia/metrics/blob/master/pageviews/embr_py/squidrow.py [22:03:22] specifically, here: https://github.com/wikimedia/metrics/blob/master/pageviews/embr_py/squidrow.py#L431 [22:03:53] this is a copy of Evan's link above ^^ which was used to generate the mobile reports https://github.com/wikimedia/metrics/blob/mobile_pageviews/pageviews/embr_py/mobile_report.py [22:04:09] I mean a copy of count.py [22:06:48] i think you should also count 302 and 304 as pageviews [22:08:06] ugh, i am feeling sicker and sicker. i'm going to take a nap and see how i feel in 2h or so [22:08:23] i swear, all of you remote people have it made. [22:08:27] brb [22:08:40] erosen: you also include all the banners from December [22:08:42] that's from meta [22:08:52] try discarding those [22:08:59] but we filter on wikidpedia.org [22:09:06] wikipedia [22:09:18] on what line? [22:09:32] (didn't see that ) [22:09:45] aah, good point [22:10:18] older versions had it, but now I see that average_drifter's latest version doesn't do that [22:10:27] drdee: good catch [22:10:48] np [22:10:53] so I should add [22:11:05] average_drifter: I think we should add a check: project() == 'wikipedia.org' [22:11:07] and r.project() == "wikipedia [22:11:08] yes [22:11:09] drop meta counts [22:11:11] and rerun [22:11:13] see if bump is gone [22:11:34] average_drifter: fyi, i think squidpy expects the .org to be on there [22:12:15] erosen: oh yeah, I'm gonna use the condition you wrote [22:12:19] k [22:12:26] drdee: which urls are meta ? [22:12:52] meta.wikipedia.org [22:12:52] meta.wikimedia.org [22:12:54] hehe [22:12:56] meta.wikimedia.org [22:12:58] i mean [22:13:04] ok [22:33:38] drdee: https://github.com/wikimedia/metrics/blob/mobile_pageviews/pageviews/embr_py/mobile_report.py#L36 [22:33:42] erosen: ^^ [22:33:47] * erosen clicks [22:34:20] go for it [22:34:25] average_drifter: i think filtering by meta and wikipedia is unecessary [22:34:28] but it can't hurt [22:34:46] also i don't think meta will ever show up as a language [22:35:02] erosen: since it's on wikimedia.org and not wikipedia.org right ? [22:35:03] because it specifically checks whether the language code is in a list [22:35:10] oh, ok [22:35:29] I'm going to do the prefix tree thing also. I'm going to implement a small trie for that [22:35:35] i actually just discovered a small bug the other day which gets confused about the mobile meta site, but this should really matter [22:35:50] erosen: should/shouldn't ? [22:36:00] shouldn't --whoops! [22:36:02] ok [22:36:35] erosen: the java classifier you showed me [22:36:41] erosen: I need to tell it the features I guess [22:36:45] yeah [22:36:48] erosen: and it will figure out the classes ? [22:36:48] those are set in a prop file [22:37:00] oh it expects the labels to be one of the columns [22:37:05] which is also set in the prop file [22:37:41] like this: goldAnswerColumn=0$ [22:37:47] minus the dollar [22:38:02] goldAnswerColumn=0 [22:38:52] average_drifter: I'm going to do the prefix tree thing also. I'm going to implement a small trie for that [22:39:00] is that really necessary? [22:39:36] i find it hard to believe that we need to go advanced math to detect such a massive mismatch in page view numbers [22:41:26] drdee: ok [22:42:09] drdee: I am unsure if to view the problem as easy, medium or hard at this stage [22:42:19] but I do want to solve it [22:42:32] let's really focus on the basics, filter out the banners, the search requests and the api requests then we should really get something reasonable, [22:42:51] if not then let's do more histograms of pre-post situation [22:43:01] ok [22:43:13] we can even do the histograms based on two days of data [22:43:18] dec 1 and dec 30 [22:43:21] or whatever [22:51:40] ok we'll do histograms [22:51:47] how can I filter out banners ? [22:57:40] drdee: https://github.com/wikimedia/kraken/blob/master/kraken-generic/src/main/java/org/wikimedia/analytics/kraken/pageview/Pageview.java#L197 [22:57:50] if (this.url.getQuery() != null && this.url.getQuery().contains("BannerLoader")) { this.pageviewType = PageviewType.BANNER; [22:57:59] so this is how a banner url looks like [22:58:19] I can add that logic to mobile_report.py [22:58:58] yes just check for BannerLoader in the query string of the url [23:00:31] average_drifter: let me know if you want some help with modifying the python script, there is a good chance I have convenience functions for some of it [23:03:46] erosen: is this done: "rsync files at stat1:/home/erosen/tmp/wep_fall_2011 to dataset2:/../public/global_dev/" [23:04:15] drdee: i'm not aware of this request [23:04:17] so no.. [23:04:18] but I can [23:04:25] what is the story? [23:04:27] drdee ^^ [23:04:28] you created the task [23:04:32] https://app.asana.com/0/1057996312637/1588607345071 [23:04:46] mark as done? [23:04:51] yeah [23:04:52] definitely [23:04:54] cool [23:04:57] that was from way back [23:05:02] sorry for leaving it unattended [23:06:54] np [23:07:42] average_drifter: can I help still? I saw you guys had a problem with gp-dev that dschoon fixed [23:07:58] milimetric: i think everything is fine [23:08:06] ok [23:10:30] erosen: do you remember what this is: community-analytics.wikimedia.org [23:10:43] which codebase does that refer to? [23:11:55] ? [23:12:08] i'm not familiar with community-analytics.wikimedia.org [23:13:40] rfaulkner? [23:13:50] ^^ [23:13:51] drdee [23:13:55] community-analytics.wikimedia.org [23:13:59] does that ring a bell? [23:14:02] ah [23:14:10] or is that metrics api? [23:14:15] but like very old [23:14:19] yeah hat's obsolete let me take a look at what's in there [23:14:27] no not metrics-api [23:14:35] https://app.asana.com/0/1057996312637/1722220877668 [23:15:14] ok i am closing that task [23:16:29] looks like i don't have access to view the task [23:16:44] but community-analytics.wikimedia.org can be removed [23:18:39] ok [23:18:39] done [23:18:39] ty [23:20:48] np