[03:59:40] erosen: wow [03:59:49] erosen: you're runnin some heavy stuff on stat1 [03:59:57] erosen: what's it doing ? [04:00:10] just asking because I'm running some stuff also [05:19:32] average_drifter: sorry about that [05:20:05] was making a bunch of api requests [06:35:41] hey milimetric [13:27:30] morning everyone [13:30:15] hello milimetric [13:30:24] morning Yuvi [13:31:10] A few people working with Center for Internet and Society for India doing Wiki work asked me about how the data is gathered for reportcard.wmflabs.org [13:31:15] I pointed them here, should be here in a while I guess [13:31:38] milimetric: what is your TZ, btw? [13:32:03] EST [13:32:08] (New York / Philadelphia) [13:32:17] ah [13:32:23] oh cool, thanks! [13:32:32] yeah, I basically compile that every month [13:32:54] milimetric: is it picking data up from our internal servers? [13:33:00] or is the base data publicly available? [13:33:16] Erik Zachte compiles the actual data [13:33:43] I'm not as familiar with how that process works, but drdee can tell you about it in detail [13:34:02] sorry - I was a bit distracted [13:34:19] sweet [13:34:21] so Erik takes data from Comscore and our own wikistats pageview data [13:34:31] comscore? I wasn't aware of that.. [13:34:34] and then I run it through some processing and put it up on reportcard [13:34:59] Comscore is a third party that estimates unique users for every major site on the internet [13:35:10] yeah, but I didn't know we used them. [13:35:10] they're not very accurate but they're consistent, so everyone uses them [13:35:29] so they offered us their data for free (it usually costs $) [13:35:35] ah [13:35:41] and we're using them until we can set up something better of our own [13:36:54] ah [13:36:54] nice [13:37:04] and the wikistats are the same ones that are available on dumps. I suppose [14:50:28] mooooorning [14:59:52] morning! [15:00:31] mo00ORNING! [15:07:12] today is REINSTALL day [15:07:18] :D [15:07:22] morning [15:07:39] I erosen - we have default colors finally :) [15:07:59] and proper ticks for time spans < 1 month [15:08:16] yay! [15:08:19] awesome [15:08:30] gonna deploy to gp-dev in a sec [15:08:33] can we redeploy gp-dev without much trouble/ [15:08:34] nice [15:09:03] fab gp_dev deploy baby :) [15:09:37] footnote: I spent some years growing up around Detroit so I use the word baby weird [15:09:49] hehe [15:10:10] i guess I also spent enough time around detroit for that usage to seem normal, so don't worry [15:10:22] it is hard to communicate the intonation of IRC though [15:10:34] drdee_ I don't understand what EZ did in that email because he's many levels above me in grep and stuff. But I still don't see any spaces in that file. [15:10:50] true erosen [15:11:02] that's funny cause i wanted to use the word baby yesterday in a pm to erosen but i didn't [15:11:14] but now the air is clear [15:11:21] so it's BABY TIME [15:11:26] :D [15:11:28] hehe [15:11:37] LOL [15:11:50] hahah [15:16:04] erosen: http://gp-dev.wmflabs.org/datasources [15:16:16] now with colors [15:17:22] awesome [15:18:35] milimetric: now for the last step in making these big datasources useful: do you have any ideas for making it easier to associate a line with the column name? [15:18:54] as it stands you have to infer the name based on the value, or use the colors [15:19:12] yeah, totally, maybe when you hover over the legend the line thickens a bit [15:19:34] or gets some line markers [15:20:05] you can sort of do it now by toggling metrics on and off [15:20:11] good point [15:20:19] but that's a poor excuse [15:20:37] I'll try a few things and see what looks good [15:20:48] i think being able to look up in the opposite direction might be useful too [15:20:55] and add an option to the legend node that's on by default for the Quick Peek thing [15:20:59] like hovering over the line and the name thickens or hovers [15:21:17] that seems reasonable [17:18:23] milimetric, ping. [17:18:42] hi geohacker! [17:18:43] :) [17:19:25] milimetric, hey hey. Good to catch you here. [17:19:53] milimetric, hope Yuvi gave you an introduction. [17:19:55] yeah, I'm on EST so I'm usually here from about 8:00am to 7pm [17:20:04] yes, he mentioned he was sending people our way :) [17:20:14] milimetric, ah cool :) [17:20:32] I'm in the IST, the Indian Stretchable Time, you know :P [17:20:46] haha, nice [17:20:49] so how can I help [17:21:16] alright. so I'm going to take a deep dive into the indic wikipedia projects. [17:21:22] and see how they are doing. [17:21:30] 20 of them. [17:22:01] cool [17:22:06] milimetric, these: https://meta.wikimedia.org/wiki/India_Access_To_Knowledge/Indic_Languages [17:22:59] I want to begin with basic parameters like number of articles, views, edits, size of edits etc. [17:23:13] how and where do I begin to fetch the data? [17:23:25] is the first question. [17:23:30] gotcha [17:23:30] geohacker: I might be able to help a bit as well [17:23:42] I work with Grantmaking and programs [17:23:44] yeah, erosen has the most complete and up to date stats along those lines [17:23:49] doing similar stuff [17:23:51] erosen, hello! good to know that. [17:23:54] awesome. [17:23:59] i'm finding some links right now [17:24:11] so I was poking around ToolServer and realized that we can run stuff on it. [17:24:27] but thought I would find things than writing scripts from scratch. [17:24:34] yeah [17:24:37] I mostly work on making the visualization engine behind these charts. We're still working on collecting and making all the data easily searchable [17:24:39] there are a fair number of data sources [17:24:52] milimetric, okay. [17:24:59] erosen, right. [17:25:05] i've actually never made much progress with toolserver (that is making an account) and I have access to the internal DBs [17:25:25] I don't have both. [17:25:26] but to start you should check out stats.wikimedia.org [17:25:32] been there. [17:25:34] k [17:25:51] check this out: http://gp-dev.wmflabs.org/datasources [17:25:57] checking. [17:26:05] once all of the datasources have loaded (takes 15s) [17:26:12] stats.wikimedia.org has data since 2012 [17:26:15] filter on "indic" [17:26:15] or late 2011. [17:26:28] hmm stats.wikimedia should have older data [17:26:36] perhaps some of the charts are out of date [17:26:55] right. [17:27:07] same with most of the data collection scripts on toolserver. [17:27:23] but the data is "there" in some sense for most questions [17:27:26] check out: http://gp-dev.wmflabs.org/graphs/indic_language_active_editors [17:27:42] this comes from stats.wikimedia (through the csv files it creates on a server which I have access to) [17:28:10] right. [17:28:16] looks clean. [17:28:44] that has data since 2001 [17:28:47] yeah [17:29:04] and you can grab the csvs from that same site [17:29:07] (thanks to milimetric) [17:29:21] perfect. [17:30:00] you can also check out language specific graphs about recent geography based editing pattern [17:30:36] okay. [17:30:46] let me share the list of things that I want right now. [17:30:47] http://gp-dev.wmflabs.org/graphs/hi_top10 [17:30:50] 1 sec. [17:30:55] ooh - a map of india would be very cool guys [17:31:05] broken down by states [17:31:07] we're not quite there to the level of a map [17:31:17] but I do have edits by city sitting around in a database [17:31:18] but the languages map to states pretty much right? [17:31:29] well, in theory, yeah [17:31:42] here: http://openetherpad.org/2T6EAvlqGG [17:31:47] I actually don't know much about the details of indic languages [17:31:48] milimetric, that's the plan [17:31:53] I make maps :D [17:32:13] milimetric, languages won't map to states in India. [17:32:14] cool, if you find a projection for India I'd love to take a look [17:32:15] are you a d3 hacker by chance? [17:32:17] we use d3 [17:32:20] i was gonna say the same thing [17:32:28] erosen, been doing a lot of stuff with d3 lately. [17:32:59] milimetric, figured that. Limn looks really interesting. [17:33:21] it's getting there :) [17:33:30] erosen, milimetric: here's the list - http://openetherpad.org/2T6EAvlqGG [17:33:50] milimetric, I shall jump in after I get this project out of the way. [17:33:58] geohacker: the ether pad link is hanging. sure it works [17:33:59] ? [17:33:59] nvm link works [17:33:59] ohhh great [17:34:00] yep, I took a look. You should browse erosen's datasources and fill out the etherpad when you find an answer [17:34:15] okay cool. [17:34:28] I managed to find few from erosen's link. [17:34:31] we have some d3 map viz stuff working, but we've mostly been using country level choropleths using the ISO country ids [17:34:33] will link it back [17:34:43] yeah, you're more than welcome aboard geohacker. The project's written in coco right now but we're not using almost anything fancy out of the language [17:34:47] you know javascript? [17:34:53] erosen, oh right. [17:34:53] geohacker: I believe all of the metrics you're interested in exist [17:34:58] milimetric, yes. [17:35:05] erosen, perfect. [17:35:21] so I'll fill in the etherpad from your link [17:35:30] cool, so let me know what you think about https://github.com/satyr/coco [17:35:36] where do I look for Article size? [17:35:49] milimetric, sure. bookmarking. [17:35:49] It's an open debate whether to keep Limn in coco or move it back to JS [17:36:03] milimetric, I like JS. [17:36:16] but that's completely personal. [17:36:26] have you tried coco? [17:36:59] milimetric, a bit. didn't get really fond of it though. [17:37:00] I felt the same way and after about a week of coco I could definitely see a lot of advantages. Limn's hosted on github so you can take a look for yourself: https://github.com/wikimedia/limn [17:37:12] I thought it diluted my ideas of JS [17:37:13] and I'm always glad to do a google hangout or something, walk through the code [17:37:21] hello geohacker [17:37:24] milimetric, that would be awesome! [17:37:33] RagePanda, oh hey [17:37:34] cool :) just ping me whenever [17:38:02] milimetric, great. [17:38:19] erosen, so let me go back to the links and ping you in a bit? [17:39:42] what is mean edits? [17:43:18] geohacker: sorry for the lapse. the column names are taken directly form the per language tables at stats.wikimedia.org [17:43:22] for example http://stats.wikimedia.org/EN/TablesWikipediaEN.htm [17:43:32] "H = Mean number of revisions per article " [17:44:17] okay. [17:44:50] the descriptions are at the bottom of the really big table at the top [17:45:01] found them. great. [17:46:26] erosen, milimetric: so here's the revised list http://openetherpad.org/2T6EAvlqGG [17:46:48] I've moved the ones which I couldn't find to the bottom of the pad. [17:47:17] cool [17:47:38] so I think i have that data around but in the script that updates the dashboard, I don't create a graph for those fields [17:47:40] let me check [17:48:03] cool. [17:48:56] also, can we pack these scripts and run them off toolserver? [17:49:16] so that even after my analytics exercise, people can take the code and give it a spin. [17:50:51] hmm [17:51:05] geohacker: it's not clear how that would work quite yet [17:51:19] basically the pipeline for these metrics starts out with the xml dumps [17:51:36] okay. [17:51:41] which then get process by some perl scripts written by Erik Zachte [17:52:01] and then i just grab the output of those scripts and aggregate them and put them on dashboards [17:52:02] right. saw those scripts on stats page. [17:52:35] erosen, hmm. [17:52:56] these dashboards are updated monthly, when we get new data from Erik [17:53:13] if we can tell people where these csvs generated are lying? [17:53:16] i also have some scripts to do this myself, using the databases, which could be done on toolserver [17:53:21] but it is rather intensive [17:53:24] erosen, so that's manual? [17:53:33] right. can imagine. [17:53:40] which part is manual? [17:53:53] also, regarding new editors: http://gp-dev.wmflabs.org/graphs/indic_language_new_editors [17:53:58] updating the dashboards every month? [17:54:12] * geohacker click [17:54:39] it is currently manual, only because for the big languages, it can take a variable number of days to be finished [17:54:49] however, I should just make a cron job as a backup [17:55:01] hmm okay. [17:55:09] consider it automated, if there is a need [17:55:20] mornin [17:55:21] it's a todo [17:56:06] erosen, cool. so new editors is done. articles, new articles, size? [17:56:32] i'm a bit puzzled [17:56:33] how's progress toward the metrics meeting? [17:56:59] hey dschoon [17:57:02] erosen, did I confuse with too many questions? sorry! [17:57:02] hihi [17:57:03] I know those numbers are on stats and I've grabbed them before, but I'll need to find which file they wind up in [17:57:07] I pushed out the latest data [17:57:09] didn't update the code [17:57:11] woo [17:57:14] ok. [17:57:19] oh crap - forgot to cc you on the email to EM [17:57:19] seems reasonable. [17:57:28] geohacker: np, i think I'm following [17:57:48] erosen, alright. do you think we should take this over email? [17:58:16] sure, can't hurt [17:58:33] I've got some other stuff to do in a couple minutes anyway [17:58:55] also it might be useful for you to connect with jessie wild, whom I also work with who has embarked on an almost identical project [17:59:09] i'm not really sure where we stand on a few things [17:59:09] I don't have the links handy, but I will CC her on e-mail [17:59:15] erosen, okay. I'll post the list in the email. so we can bust them as and when you find the dumps. [17:59:18] what was the result of the long thread about x-mf headers? [17:59:26] great [17:59:27] i'm still a little fuzzy-headed. apologies [17:59:30] erosen, perfect. sounds great. thanks! [17:59:37] np, glad to help [18:00:25] erosen, will wait for your email and I'll respond with the list. [18:00:40] geohacker: what is your e-mail? [18:00:50] erosen, in your DM. [18:00:59] whoops [18:01:00] got it [18:01:11] :) [18:05:49] erosen, apart from basic parameters and geography of edits, I wanted to make a story out of the edit wars. [18:06:55] visualizing them and find traces of puppetry and claims. [18:09:13] nice [18:09:23] i'd be happy to help direct you to useful sources [18:09:41] what exactly is your plan for analyzing this stuff? [18:09:47] geohacker: ^^ [18:10:04] "server error" [18:10:42] erosen, we don't have an exact plan of sorts right away, but would love to see the data and see what we can make out of it. [18:11:04] gotcha [18:11:05] the indic mailing lists have witnessed lot of buzz around edit wars [18:11:13] recently the punjabi wikipedia. [18:11:39] I wanted to personally see how these are resolved and how people and taking this forward. [18:11:47] i see [18:11:49] positively or otherwise. [18:12:10] yeah, i'm having trouble thinking of a useful automatically generated data source [18:12:26] I'm convinced somehow that this will give us a better handle of the community. [18:12:44] hmm. [18:13:03] geohacker: we can do revert detection [18:13:47] erosen, hmm that should do. can we also get the size of reverts? [18:14:10] it is feasibl [18:14:11] e [18:14:18] but i'm not sure if it already exists [18:15:06] erosen, okay. how do go about checking that? do you think I can go hunt it while you are busy? [18:15:20] i suspect others have done something like it [18:15:28] but I don't know of any off the top of my head [18:16:18] erosen, cool, this is not immediately required. [18:16:38] what we need this week are the basic parameters I shared. [18:16:45] responding to your email now [18:16:45] gotcha [18:21:55] erosen, done. [18:22:33] milimetric, thank you and I'll keep you posted about this is going. [18:22:44] *how. [18:48:09] hey folks – I have to skip the weekly scrum again, can you loop me in on anything important? [19:02:08] robla, I'm gonna wfh after all. I'm still sniffling and I don't want to get anyone sick. [19:02:47] robla, I'm gonna wfh after all. I'm still sniffling and I don't want to get anyone sick. [19:03:11] k....makes sense [19:15:14] okay pig zero udf works [19:43:09] drdee: word on the street is that there is an up to date-ish diffdb for enwiki [19:43:13] do you know anything about this? [19:43:56] milimetric: do you have permission to invite me to the JS bootcamp event? [19:44:05] i'll check [19:44:41] added [19:49:59] hey drdee_, erosen, I think you guys said it was easy to get the data behind this: http://en.wikipedia.org/wiki/File:Enwp_retention_vs_active_editors.png [19:50:06] I'm making a graph of it in Limn [19:50:22] we have the active editors data [19:50:27] it's already in reportcard. [19:50:40] right, the retention is what we need [19:53:59] heh, that survey is like obsessed with "others": He/She needs the approval of others., He/She works too hard for others’ acceptance. [19:54:41] erosen - I invited you to the JS Bootcamp, you got it right? [19:55:58] robla - we have to play a game of chess. I have no idea if you're a gifted strategist [19:56:07] yeah [19:56:42] oh, that said, I should play all of you in chess. I feel like it would be cathartic for me to kick all your butts :) [19:57:47] we need a generless pronoun [19:57:48] thon! [19:57:55] i will play you in chess! [19:58:16] http://www.qwantz.com/index.php?comic=2079 [19:58:32] drdee_: before i reply to the mobile thread [19:58:38] we should chat about x-cs [19:59:10] ok [19:59:13] but later [19:59:24] Welp. I want to reply sooner rather than later. [19:59:33] in a hour or so? [19:59:36] sure [20:00:13] milimetric: no invite yet [20:02:00] cool data explorer tool: http://explorer.datahub.io/#project/dataexplorer-684501 [20:02:47] i see nothing. [20:03:15] is there a github project? [20:04:31] thon must add data if thon want see things: http://explorer.datahub.io/#dashboard [20:04:49] it's reclinejs http://reclinejs.com/demos/ [20:05:01] or a tool build using recline [20:14:44] haha [20:14:47] yeah. [20:14:57] i just added CORS support for /data [20:14:58] if you pull [20:15:01] you can use it with limn. [20:16:08] man, their graphs are butt-ugly. [20:32:10] finished with survey! [20:32:12] only took... [20:32:20] 2 hours [20:32:29] :D [20:32:42] nice dschoon! Instead of /x/? [20:32:47] No. [20:32:52] They serve different purposes. [20:32:56] we can't get rid of /x [20:33:03] (as I said in an asana comment) [20:33:12] The *responding* server needs to set the CORS header. [20:33:25] there's no way for the page to say, "shut up and let me load whatever I want" [20:35:49] ^^ milimetric [20:35:50] what's this doing in datasource::dataUrl [20:35:51] return url unless url [20:36:13] gotcha, i don't know what CORS is, reading [20:37:22] shouldn't we be doing this client side? [20:58:12] CORS lets a server say it's okay for clients to abrogate the cross-site scripting restrictions [20:58:30] so XHR can make requests to another host [20:59:17] the `return url` thing was so when url is unset/empty, the datasource isn't valid [20:59:33] otherwise it would get a cachebuster appended, and the isValid check would pass [20:59:36] ^^ milimetric [20:59:48] ps. i'm in a meeting about search at the same time. sorry about the lage. [20:59:49] *lag [21:00:07] np - search is impt [21:47:18] milimetric: dschoon: I'm breaking up the "Add retention vs. active editors graph on reportcard, implement log/linear scale switching" into three tasks. [21:47:23] it looks like Asana puts these automatically in the current sprint, and I don't think that's necessary [21:47:49] sounds good. [21:47:51] also, brb lunch [21:47:52] oh you can drag them around wherever you'd like. But for analysis, it's a good idea to leave them here so we can see how we did [21:54:14] can I link/reference one task from the comment of another task? [21:54:48] the URLs are stable [21:54:50] so i think so [21:55:30] (i realized we have a meeting in 5, so lunch after) [21:59:25] ...or not. :) [22:00:00] oh, heya [22:00:19] dschoon: I figured you were gone, and didn't want to wait around [22:00:20] friday is fine [22:00:23] better to be f2f [22:00:28] no worries. [22:00:33] so long as it's not a pain for you [22:00:39] nope, it's fine [22:01:21] I put it on Friday just in case you have the lingering sort of cold/flu/ook [22:01:30] sounds good. [22:02:14] then i shall actually brb for food now :) [22:08:28] Asana converts asana.com/blahblahtaskblah into a proper hyperlink to the title. that's a nice surprise [22:15:50] back [22:15:55] nice. [22:20:52] ottomata: is there much we can do with oozie/pig while hue is down? [22:21:35] yeah [22:21:40] its totally workable [22:21:43] you just can use the gui [22:21:46] cool. wanna do that now? [22:21:51] but, my head is in getting it back up [22:21:55] or whenever you're not in the middle of something? [22:21:59] yeah, totes. [22:22:02] just ping me when you're free. [22:22:03] or at least getting a commit in that will be reviewed tomorrow morning [22:22:09] i'm goign to finsih this and then peace out for the day [22:22:09] i have some little things i can do to help out dan [22:22:12] so let's do that tomorrow [22:22:16] sounds good. [22:22:39] i'll knock off some stuff to help evan transition his dashboards. [22:27:28] actually, dschoon [22:27:32] ung, this is going to take longer than I thought [22:27:34] want a quick overview? [22:27:36] aiight [22:27:37] sure. [22:27:45] hangout? [22:27:51] or hm [22:27:56] probably unnecessary [22:28:03] hangout [22:28:07] okay [22:28:08] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc?authuser=1 [22:52:23] ty ottomata [22:52:27] yup! [22:52:42] drdee: since the best way to learn is by doing, are there any job tasks we don't have scripts for yet? [22:52:49] (tasks which we do have data for?) [22:53:00] not really :) [22:53:16] what would be useful then? [22:53:19] we need to talk about cubes, you me and dan [22:53:25] and rollups [22:53:27] and where [22:53:35] da [22:53:35] and how [22:53:53] that's important AFAIK [22:53:57] agreed. [22:53:59] i mean AFAIC [22:54:48] let's do that tomorrow after scrum [22:54:55] since EST is almost done for the day [22:56:53] sounds good [22:56:57] after scrum after metrics meting [22:57:01] ja [22:57:14] i solved the mystery of the unencoded space in urls' [22:57:27] it's mostly content-type, yes? [22:57:34] text/html; charset=utf8 [22:57:38] (which is legal) [22:57:38] no [22:57:40] in the url [22:57:43] ooo [22:57:44] introducing the tab as field delimiter exposes new bugs [22:57:46] where from? [22:57:51] read the mail :) [22:57:56] either nginx or mediawiki [22:58:05] aiight [22:58:26] emphasis on exposes, the tab did not introduce new bugs [22:58:41] ja [23:00:49] are they all *actually* hits from the googlebot? [23:01:03] it could well be that the bot is sending invalid requests [23:02:01] ^^ drdee [23:02:04] ^^ drdee_ [23:02:20] 1sec [23:03:15] taking the first url and running it in wget [23:03:19] wget "https://en.wikivoyage.org/wiki/User talk:Aaron Schulz" [23:03:19] --2013-02-06 18:02:39-- https://en.wikivoyage.org/wiki/User%20talk:Aaron%20Schulz [23:03:20] Resolving en.wikivoyage.org... 208.80.154.243, 2620:0:861:ed1a::13 [23:03:22] Connecting to en.wikivoyage.org|208.80.154.243|:443... connected. [23:03:22] HTTP request sent, awaiting response... 301 Moved Permanently [23:03:35] so the 301 seems correct [23:03:50] nginx does not seem to encode the spaces [23:04:02] but not 100% sure [23:05:12] as i said, it's between 2k and 5k per day [23:10:10] hm [23:11:54] my sample only showed google bot hits, i will poke around a bit more to see if other user agents have the same behavior [23:20:58] it only seems to happen on the combination of User talk pages on Nginx visited by Google bot and then not always either [23:23:36] even more specific: it's combination of Google bot, User talk page that gives a 301 response code on Nginx and then still not always [23:23:49] dschoon, milimetric ^^ [23:29:27] this is pretty cool/scary: http://blog.krisk.org/2013/02/packets-of-death.html