[01:40:31] average_drifter: around? [01:46:04] drdee: I'm around, working on wikistats to finish and get the reports :| [01:46:27] I will get to the webstatscollector [01:48:14] ok I just got the reports (x1000) [01:48:20] they have astronomic numbers :( [01:55:32] only good news is I see constant factor of 7 that the numbers differ [01:55:55] (the WikiReports.pl numbers compared to the new mobile pageviews report ) [02:04:32] I don't know where 7 comes from, I can only guess [02:07:20] but how is this possible? you said the numbers would be the same as revision 41 [02:09:10] drdee: they are, the very same ones, but then again, I don't know all the logic in WikiReports.pl [02:09:48] ez just said to pass it a csv with the needed data and format, that's what I did [02:10:35] can you update your network flow diagram so it reflects the current source code? [02:10:52] yes [14:10:28] ooh, quiet channel this morning [14:10:39] morning everyone :) [14:21:28] morning! [14:36:50] mooooorning [14:46:55] hey drdee [14:47:05] hye [14:47:05] I've changed the pageview logic to handle casing consistently [14:47:12] this might affect other things, though all the tests pass [14:47:29] basically, new Pageview(...) makes an object with all fields Lowercase [14:48:06] then any checks for any string compare against Lowercase strings [14:48:15] ok, but if we ever create a canonicalTitle function than it should keep the original casing [14:48:16] so I made .contains("get") instead of "GET" [14:48:30] yes! [14:48:33] cool [14:48:38] we can keep original casing in separate fields [14:48:44] like originalMethod [14:48:52] ottomata, am looking at your site [14:48:56] if we need to check anything with case sensitivity [14:49:07] cool [14:49:16] Storm King’s Eye <- what was that? [14:49:26] ah, ha [14:49:26] average_drifter: around?? [14:49:32] an art piece here: [14:49:32] http://www.stormking.org/ [14:50:21] andrew's abstract interpretation of Hurricane Sandy staring down at him [14:50:22] :) [14:52:21] haha [14:55:59] hey ottomata, since last night I've been getting job failed when running pig [14:56:00] Failed to read data from "hdfs:///wmf/raw/webrequest/webrequest-wikipedia-mobile/2013-03-22_16.30*" [14:56:14] and these are paths I used to read from fine before last night [14:56:24] hm [14:57:05] i think that's not your error [14:57:20] it sometimes says that when something else is the source of the error [14:57:43] that may be but my job dies :) [14:57:54] this one? [14:57:54] http://localhost:19888/jobhistory/logs/analytics1020:8041/container_1364239892421_2625_01_000002/attempt_1364239892421_2625_m_000000_0/dandreescu [14:58:06] 2013-03-29 14:55:14,527 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException [14:58:06] at org.wikimedia.analytics.kraken.pageview.Pageview.(Pageview.java:105) [14:58:06] at org.wikimedia.analytics.kraken.pig.PageViewFilterFunc.exec(PageViewFilterFunc.java:94) [14:58:06] at org.wikimedia.analytics.kraken.pig.PageViewFilterFunc.exec(PageViewFilterFunc.java:74) [14:58:21] all of them [14:58:27] oh! [14:58:29] what? [14:58:32] how do i see that... [14:59:16] i gotta start that ktun thing? [14:59:56] yeah [15:00:03] ktunnel jobs [15:00:06] ktunnel history [15:00:09] you'll need those two [15:00:10] and [15:00:23] you'll have to change any urls to 'localhost' [15:00:24] milimetic have you pushed that code? [15:00:24] any domains [15:00:28] it will redirect you a few times [15:00:37] haven't pushed the code, no [15:00:51] ktunnel gives me permission denied for jobs or histories [15:00:59] can you paste (PageViewFilterFunc.java) in a gist? [15:01:04] sure [15:01:07] hm [15:01:20] I see, the clue is tiny in the pig output but it's there: SimplePigStats - ERROR: null [15:01:35] ahhh your use is different, isn't it [15:02:00] oh, hm [15:02:03] ok milimetric [15:02:03] do this [15:02:14] ktunnel jobs dandreescu@analytics1001.wikimedia.org [15:03:24] thanks ottomata, it works to analytics1002 [15:03:56] i must've been issuing the command wrong [15:04:23] but got confused 'cause it was working for "hue" instead of "jobs" [15:04:32] yeah, that is weird [15:04:41] if it works for one it should work for all [15:09:50] the error's simple to fix drdee, I'm just adding checks for if Pageview fields are null (referer was causing the problem in this case) [15:10:06] k [15:27:11] btw, don't know if you guys knew this - if you overwrite the jars that you're *already* registered in pig, it'll use the new jars [15:27:30] so you can just rsync them, and re-issue a dump my_result; and you'll be in business [15:27:49] oh while you are in grunt? [15:49:45] yes, ottomata, while in grunt [15:50:59] cool [16:09:11] dschooooon, morning, who want to help me make a python loopy a bit more elegant? [16:24:19] who's got python fu? [16:25:37] a bit [16:26:08] https://gist.github.com/ottomata/5271910 [16:26:19] i'm sure that function can be waayyy simplified [16:26:19] get_udp2log_ports [16:26:50] cmd will be the udp2log daemon command as an array [16:27:00] ['/usr/bin/udp2log', '--config-file=/etc/udp2log/wmvm', '--daemon', '-p', '8420', '--recv-queue=524288', ''] [16:27:04] i want 8420 [16:27:59] does 8420 always have the same index in that list? [16:28:24] not necessarily, n [16:28:26] I guess not looking at the gist [16:28:27] no [16:29:02] are you always looking for 8420? [16:29:37] join the list and use a regex? [16:30:46] no [16:30:48] could be anything [16:30:53] i want the listen port of udp2log processes [16:31:01] running udp2log processes [16:31:53] why not join the list and do a regex for '-p \d\d\d\d' or something like that? [16:32:17] ugh [16:32:26] my head hurts so bad [16:32:43] are you capable of having our meeting now? [16:32:51] sure. [16:32:59] ok [16:33:11] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [16:34:13] brb lunch and I can look at your python a bit ottomata [16:34:45] hm ,drdee, not sure if that is better [16:34:53] it would be fine [16:35:04] but, i kinda like the index thing, doesn't use regex, etc. [16:35:14] buuuuuut, what i was hoping for was a list comprehension cool thing or something [16:35:17] which I don't know very well [16:35:30] for example [16:35:38] ori does this to get the pids that match pattern [16:35:38] return [pid for pid in iter_pids() if pattern in get_cmd(pid)[0]] [16:35:44] i'd like to do the same [16:35:51] except return the ports instead [16:36:29] what is this about? [16:36:32] i thought about list comprehension but there is a lot of cleaning up that happens in the function [16:36:35] (brb a moment) [16:36:42] not sure if you can cram that in a list comprehension [16:38:01] average_drifter: where are thou? [16:38:01] dschoon, ori has an awesome python ganglia module to aggregate running udp2log process udp stats [16:38:08] but. it uses /proc//fd [16:38:11] which is readable only by root [16:38:15] and ganglia does not have root perms [16:38:30] so, instead of looking up the inodes of the open sockets in /proc//fd [16:38:48] and matching those against /proc/net/udp [16:39:46] i'm extracting the port from the running command [16:39:55] and looking for that in /proc/net/udp [16:40:16] i just feel like my function is longer than it needs to be, and was looking for python fu to elegantize it :) [16:43:20] actually if it works then that's good enough, right? [16:43:28] drdee: here [16:43:49] status of webstatscollector segfault fix and wikistats reprot? [16:44:15] in progress [16:44:18] but this is ready https://github.com/wikimedia/metrics/tree/master/pageviews/new_mobile_pageviews_report [16:46:00] yeah it works, it just looks so ugly next to ori-l's code [16:46:02] :p [16:46:45] ottomata: I can write that in one line :D [16:46:45] send me a link [16:46:48] just kidding [16:46:49] i'll take a look [16:46:51] average_drifter: is the new report still running? [16:47:08] drdee: nope, currently there's a factor of x7 which I cannot account for [16:47:23] so I'm blocked again and need to ask Erik what the logic in WikiReports.pl is [16:48:01] did you ask him already? [16:48:16] no, I'm writing him an e-mail now [16:48:26] CC me and Kraig as well [16:48:28] ok [16:48:41] ugh, ok [16:48:45] i am useless [16:49:10] my head is killing me. i have to lay down. i'll try again around lunch [16:49:23] fucking migraines [16:51:24] *dies* [16:51:37] before i pass out, a quick update [16:51:44] my concat script using hadoop streaming worked [16:52:14] i need to check on it, but i launched a new device props job with it, last night [16:52:28] if that worked, huzzah, 61 is done [16:52:35] and i'll show dan how to update his job for 92 [16:52:52] i need to test my script for sessions, but i think it's basically done [16:53:01] they're all checked in, so feel free to poke around [16:53:03] bbl [16:53:05] * dschoon dies [16:54:37] average_drifter: can you please make it to the scrum in 6 minutes? [16:58:14] awesome, feel better dschoon! [17:00:42] scrum my!!! ottomata, milimetric, average_drifter [17:02:30] ottomata, milimetric, average_drifter ^^ [17:02:40] we're in scrum [17:02:43] we're here [17:02:43] crazy peoples drdee [17:02:48] the usual [17:02:49] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [18:06:53] drdee: ok, solve the x7 factor, but the numers are still off [18:07:11] drdee: milimetric pointed out I had a bug(x7 lines in the CSV duplicated) [18:07:22] solved the bug [18:07:30] ok let me just upload so you can see it now [18:07:36] yeah, let's take a look [18:10:22] http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r45-wikireport-x7-bug-solved/out_sp/EN/TablesPageViewsMonthlyMobile.htm [18:10:55] compared to http://stats.wikimedia.org/EN/TablesPageViewsMonthlyMobile.htm [18:16:01] nice work milimetric! [18:16:07] ty [18:16:24] i am ready for drafting email if you want to [18:17:18] ok, so average_drifter, dree, are these differences still cause for concern? [18:17:29] or just to be expected from a more conservative pageview logic? [18:17:45] well the numbers won't match definition but they should be in the sam ballpark [18:20:43] drdee: should I still send the e-mail about the difference ? [18:20:47] I can send it [18:22:23] no, let's first look at your flow diagram [18:22:56] ok [18:30:34] the numbers seem in the same ballpark to me, and if you're not including bots or something like that, it would certainly account for the difference [18:52:53] ottomata: couldn't get your gist to work because I think it has some indexing problems with the cmd variable. But here's an example of list comprehension that's hopefully similar enough: [18:52:53] https://gist.github.com/milimetric/5272159 [19:13:45] mmm, ok thanks [19:13:49] moving to a cafe [19:13:51] be back on in abit [19:14:03] (well, leaving in 10) [19:16:00] average_drifter: i updated the flow diagram [19:16:08] make sure your implementation accurately reflects this [19:16:16] then rerun your report [19:16:31] i removed bot filtering to make the numbers match webstatscollector logic more [19:22:20] average_drifter: ^^ [19:23:08] drdee: looking [19:25:35] brb [19:44:54] drdee: http://(wikiproject).m.wiki*.org/w/api.php?action=(mobile)view [19:45:05] drdee: does that mean the "mobile" is optional ? [19:45:09] yes [19:45:17] ok [19:45:18] there are two valid api actions [19:45:20] view and mobileview [19:45:32] ok, makes sense [19:59:38] ottomata: can you merge https://gerrit.wikimedia.org/r/56633 [20:02:49] done! [20:02:51] thx! [20:16:29] drdee: no bot detection ? [20:17:58] no. we must mimic webstatscollector logic that way the numbers will be much more comparable [20:18:42] whiiiich reminds me [20:18:46] average_drifter :) [20:18:50] how goes that webstatscollector bug? [20:19:40] ottomata: unfortunately have to fix this problem with the reports first [20:19:58] I'll get a 7h window while they run, so I'll fix it then [20:20:06] sorry for the delay [20:20:40] no probls [20:52:43] ok, so [20:52:55] the reason we're getting such low numbers on wmf mobile apps [20:53:00] ottomata and drdee [20:53:31] is that all the /w/api.php URLs that we have with WikipediaMobile/.*Android [20:53:42] have ?.*search in them [20:53:57] and we're not supposed to count search [21:05:28] sorry guys [21:05:32] internet is too flaky for hangout it seems [21:13:16] ottomata [21:13:29] https://gerrit.wikimedia.org/r/#/c/52606/ has been merged [21:13:38] do we need to monitor this in kraken [21:13:43] ? [21:13:58] uh, monitor, no [21:14:04] but it will break any existing scripts that used X-CS [21:14:08] particulary the zero ones [21:14:24] oh [21:14:27] its backwards compatible? [21:14:41] that's what i meant by monitoring :) if the scripts use the pig udf that i wrote then it will not break [21:15:04] UNGh, i thoguth I commented [21:15:08] wy are we calling the key zero? [21:15:10] ahhh foo [21:15:16] » » set req.http.X-Analytics = "zero=" + req.http.X-CS; [21:15:36] really? hm, i guess it is the same field [21:15:46] but the field passed to your udf will not just be the X-CS [21:15:50] it will be the key=value pair [21:16:20] oh, i see you responded to my comment [21:16:21] ung [21:16:34] mnc would've been better :/ [21:18:05] i agree the name is dumb. anyway, it's also handled in the zero UDF [21:18:14] so we should watch to make sure it works [21:18:31] but otherwise nothing should change in the results [21:19:03] (naming is hard. this is why i prefer shopping) [21:19:21] wait until you get a baby :D [21:19:25] i like naming [21:21:30] i can't believe my head STILL hurts [21:29:50] brb [21:34:30] here you go ottomata, not sure it's much prettier than yours but it's shorter: https://gist.github.com/milimetric/5272159 [21:34:59] nice! [21:35:01] it's also probably less efficient because it has to find the index of -p twice [21:35:56] and yes, zcat | grep is slow as balls when it's looking at a lot of data [21:35:59] :) [21:48:38] Cooool, check it out duuuudes [21:48:39] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=udp2log+packets+dropped&vl=&x=&n=&hreg%5B%5D=locke%7Cemery%7Coxygen%7Cgadolinium&mreg%5B%5D=drops&glegend=show&aggregate=1 [21:49:05] VERY NICE! [21:49:46] there are a bunch more stats than that oo [21:49:46] too [21:49:54] i'd like to experiment with making a udp2log view [22:02:49] drdee: can I make wiki* more precise ? [22:02:57] I mean what you wrote in the pageview definition [22:03:02] can I just make an enumeration like [22:03:03] wikibooks [22:03:05] wikinews [22:03:05] etc [22:03:06] ? [22:03:13] sure have a look at [22:03:15] 1 sec [22:03:29] https://raw.github.com/wikimedia/metrics/master/pageviews/kraken/pageview_base.png [22:03:34] looking [22:03:37] and look at the big purple box in the middle [22:03:41] that contains a list of all the wikis [22:04:03] oh cool [22:04:06] I'll use those [22:40:50] question [22:40:53] drdee: I have a question [22:41:15] hey guys [22:41:19] it's not 100% that action=(mobile)view will be the first key/value pair right ? [22:41:22] drdee: ^^ [22:41:38] not always [22:43:07] was talking with mako about adding some log fields for squids, etc. by adding a header in php. for the kind of log entries that are used by http://stats.grok.se/ [22:44:19] in particular I was thinking we could have the page id that was requested (i.e. before resolving redirects) and the page id that was actually served (possibly after redirect resolution) [22:44:57] I think he was saying that some of the most popular pages are less popular @ canonical names than through redirects [22:45:18] jeremyb_: ok, IMHO, please keep in mind that adding fields to the squid logs will mean lots of updating for multiple codebases [22:45:21] just saying [22:45:27] i know [22:45:39] haven't really looked into it much but i have read some of asher's maiils about log formats [22:45:39] ok, just wanted to put that out there [22:46:04] i was thinking something like base64 encode the curids and maybe leave one out if the two are identical to eachother [22:46:45] anyway, would be nice to get some comments/thougts on that :) [22:47:13] (i.e. identical means not a redirect) [22:47:43] also, i wonder if the stats include 404s? [22:47:47] my thinking at the moment is that if we do that, we'll have to update like 7-8 code bases, and if we're not careful that can result into bugs. I would not do that now because we have a lot to deliver and quite tight on schedule [22:48:17] well is there some public info about what you're working on now/what the schedule is? [22:48:39] i was assuming these fields would be added on the end (after all current columns). to not break so much stuff [22:49:05] how many of those 7-8 codebases are puppetized? how many are tested in labs? [22:49:07] jeremyb_: yes, and we have a minimum fields constraint, and if you add a field than we'd have to readjust that and so forth [22:49:12] how many are documented? :) [22:49:46] s/than/then/ [22:50:09] i forget, where are you? canada? [22:50:21] I don't think my location matters :) [22:50:55] ottomata, can we close https://rt.wikimedia.org/Ticket/Display.html?id=4730 ? [22:51:01] heh [22:52:09] jeremyb_: about what you wrote above. AFAIK some of our codebases are puppetized, most are documented, some are in the process of being documented [22:52:22] jeremyb_: as we speak :) [22:52:31] drdee: yes think so, but there is probably work to do to make it prettier [22:52:32] but ja close [22:52:47] oh, i dunno about the python env stuff he needs [22:53:01] 00:47 < jeremyb_> also, i wonder if the stats include 404s? [22:53:39] yes they do [22:53:56] jeremyb_: is there a reason to count 404 ? we did that, but that was for diagnostics of a nasty problem we had [22:54:01] jeremyb_: http://stat1.wikimedia.org/spetrea/embr_py_mobile_pageviews/r3-resized-charts/chart_status_code=404.png [22:54:14] jeremyb_: ok, I think drdee knows more about this [22:56:15] average_drifter: i'm not saying we should. i'm wondering if we do [22:56:16] :) [22:57:35] whoa [23:07:40] jeremyb_: so this is a feature request :) [23:08:31] jeremyb_: maybe it's been discussed on the mailing list or on github issues, or in mingle ? [23:08:39] which? [23:09:00] jeremyb_: your request to add fields [23:09:19] jeremyb_: and the question about counting 404s [23:09:20] i guess it may not have been discussed beyond grendel's [23:09:40] 404s is not a feature request. just a question [23:24:25] average_drifter: which github issues did you have in mind? [23:24:45] is there a central list of all 7-8 codebases? [23:31:23] jeremyb_: what exactly is your goal? [23:31:27] counting status codes? [23:32:55] jeremyb_: https://github.com/wikimedia/kraken ; https://github.com/embr/squidpy ; https://github.com/wikimedia/analytics-wikistats ; gerrit.wikimedia.org:29418/analytics/webstatscollector ; gerrit.wikimedia.org:29418/analytics/udp-filters.git ; [23:32:57] no [23:33:03] jeremyb_: there are more than these ones above [23:33:07] I don't know all of them [23:33:36] jeremyb_: but if you want fields added, you should inform the analytics mailing list, so if they do get added, everybody gets a chance to update their code [23:33:38] well, average_drifter, we *do* want to consolidate those over time [23:33:39] dschoon: the goal is to be able to aggregate redirects into a grand total sum for a given article. [23:33:43] ah [23:33:46] yeah, interesting [23:34:02] dschoon: sure, I was just saying that everyone should get notified of this change [23:34:06] doesn't that require access to a snapshot of the db at the time of the redirect, due to transclusion? [23:34:15] because redirects aren't delivered as a 302 [23:34:18] i don't follow [23:34:21] no they're not [23:34:25] redirects are a 200 [23:34:42] https://www.mediawiki.org/wiki/Analytics/Kraken/Request_Logging [23:34:49] "(Redirected from Analytics/Kraken/Request Logging)" [23:34:55] right [23:34:58] that's a 200 [23:35:01] right [23:35:08] so you can't know from a logline if it's a redirect [23:35:14] that's the point [23:35:15] it requires checking a table in the db [23:35:21] and because the db can change [23:35:24] it requires a static snapshot [23:35:28] again, that's the point [23:35:30] okay. [23:35:38] i'd call that "totally intractable", heh [23:35:48] oh [23:35:49] erm? [23:35:54] you're suggesting a header, arne't you? [23:36:00] (sorry, i didn't read the backchat) [23:36:13] (apologies for being slow. i also had a headache for most of today) [23:36:23] 29 22:43:07 < jeremyb_> was talking with mako about adding some log fields for squids, etc. by adding a header in php. for the kind of log entries that are used by http://stats.grok.se/ [23:36:27] yep. [23:36:27] 29 22:44:19 < jeremyb_> in particular I was thinking we could have the page id that was requested (i.e. before resolving redirects) and the page id that was actually served (possibly after redirect resolution) [23:36:31] np :) [23:36:35] yep. [23:36:38] that would work. [23:36:51] we have been trying to aggregate such information into one header, X-Analytics [23:36:54] which is a kv-field [23:37:09] i think your idea is a good one [23:37:10] right [23:37:24] and i agree it'd require two fields [23:37:25] as i said above i've read some of asher's mails, etc. [23:37:32] coolio. [23:37:48] we've been ... working with ... mingle as of late [23:37:55] anyway, i can't really flesh it out into a full proposal right now [23:37:57] can i interest you in writing up the use case? [23:37:59] okay. [23:38:03] heh [23:38:04] remind me next week? [23:38:15] i think it's important to capture this stuff [23:38:19] it would go into mingle first? not onlist? [23:38:24] *shrug* [23:38:26] it matters little [23:38:32] mingle is the ultimate destination [23:38:41] it's the archive of stories. [23:38:53] well the use case is just to get a grand total count per article instead of per URL path [23:39:07] 29 22:44:57 < jeremyb_> I think he was saying that some of the most popular pages are less popular @ canonical names than through redirects [23:39:11] right [23:39:17] *nod* [23:39:23] i think that's a great idea [23:39:30] and i think seeing that information on-wiki would be valuable [23:40:29] (i mention that only because software trust means the data couldn't be pulled from a 3rd party tool, like stats.grok.se) [23:40:42] i don't follow [23:41:35] if we wanted to show those stats on an article's special page [23:41:43] we would need a trusted source of data that we control [23:41:48] otherwise it's a security concern [23:42:06] (we've discussed this internally several times, about pulling in data from stats.grok.se, and decided we couldn't risk it) [23:42:45] (evne though it would probably positively impact edit rates) [23:48:15] have a good weekend all! [23:48:21] you too, too