[15:00:28] I just received "Your video call is full" when trying to join for standup :-( [15:02:10] milimetric, drdee hi [15:03:42] milimetric: standup [15:04:00] oh crap [15:04:00] I lost internet [15:04:19] ok [15:04:26] we'll hangout [15:07:58] morning guys [15:09:15] hey drdee [15:18:46] good morning everyone [15:19:30] qchris: could you +1 that patch again pls [15:20:16] i really want ops to get on it, and would be good to have analytics officially on board :) [15:20:22] will post a link in a sec [15:20:42] yurik: we are having scrum meeting now [15:20:47] yurik: be patient [15:21:06] yurik: I am in a meeting currently. I'll get back to you later about the code review. [15:21:35] np, thx [15:38:55] I'll have a look at the patch after eating. [15:39:16] Mhmm ... that should not go to the public... :-) [15:39:29] Well ok ... I confess being hungry :-) [16:16:30] (PS17) Stefan.petrea: [not ready for review] A step towards 818 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91207 [16:16:51] (PS18) Stefan.petrea: [ready for review] A step towards 818 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91207 [16:30:11] drdee: Since yur-ik asked for a review again, but I did not yet receive your respone by email ... What problem would the additional fields of [16:30:15] drdee: https://gerrit.wikimedia.org/r/#/c/93006/2 [16:30:19] drdee: solve? [16:31:49] that we would not have to implement it ourselves [16:32:18] but talk with yurik about this and see if the two of you can come to a shared agreement [16:34:00] ? [16:34:08] Ok. [16:38:01] qchris: we can talk now if you want [16:38:09] hangout? [16:38:24] Sure. [17:12:28] hey milimetric [17:12:31] can you join batcave? [17:12:34] i'm looking at some data [17:13:31] yes - in 1on1 now [17:46:33] milimetric: come to me! [19:59:48] We still have the story grooming entry in our calendar from the old schedule. [20:00:19] Do we have story grooming today (or on mondays as we discussed some days ago)? [20:04:24] (CR) Milimetric: "(9 comments)" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91207 (owner: Stefan.petrea) [20:08:54] qchris_away: story grooming is dead [20:09:00] I'll clean up the calendar [20:09:15] we've replaced it with the weekly tasking meeting [20:09:37] that meeting is usually on Mondays but since this Monday we have a holiday, we'll do it on Tuesday next week. [20:11:28] qchris_away: I don't see any story grooming meeting on the Team Analytics calendar, or on my personal calendar. So you might have some old entry on your personal calendar. [20:11:35] average: I reviewed your change ^ [20:11:51] I look forward to your new patchset [21:39:01] DarTar & IH|lunch: I'm looking into deleted pages. I'm wondering if you guys have already done any work with archive and logging with regards to deleted pages. [21:39:50] halfak: yes, but a while ago. What are you interested in, specifically? [21:40:32] I'm going to try to build some timeseries of page_creation/deletion rates and hopefully categorize by delete reasons as well. [21:40:53] Basically, I want to replicate: http://www-users.cs.umn.edu/~lam/papers/lam_group2009_wikipedia-longer-tail.pdf [21:41:04] And then focus on newcomers & AfC [21:41:18] right [21:41:20] Any advice or resources would be much appreciated. [21:41:29] let me look up something quickly [21:42:36] this should be the standard way to capture deletions from the logging table: SELECT DATE(log_timestamp), COUNT(*) FROM logging WHERE log_type = 'delete' AND log_action = 'delete' AND log_namespace IN (0) AND log_timestamp >= 'foo' GROUP BY 1; [21:44:59] the page curation log is also relevant for software-supported tagging of pages [21:45:10] Thanks for that. What's the deal with "log_namespace in (0)"? [21:46:47] that was to filter ns0 deletions only (I think, I'd need to double check) [21:47:09] page curation: log_action = 'delete' AND log_type = 'pagetriage-curation' [21:47:47] I' [21:47:55] ll get some of this documented. [21:48:11] I get a little nervous every time I start reverse engineering mediawiki's database. [21:48:26] yes, I had a bunch of queries filtering deletions across two or more ns, hence the IN clause [21:48:28] Thanks for your help. [21:48:48] np, btw this is relevant for the key metrics proposal we were discussing yesterday [21:48:58] I'm starting a stub on Meta this afternoon [21:49:01] milimetric: Sorry. My bad. Thanks your your explaination. [21:49:23] gotta run, bbl [21:50:24] np [21:51:11] DarTar & halfak - sounds like cool work, I'd love to see what you come up with [21:51:41] :) Hopefully I'll have some preliminary stuff tomorrow. MySQL willing. ;) [21:51:42] and halfak - quick question: how valuable would it be to be able to do this query cross-projects? [21:51:56] That would be utterly amazing. [21:52:30] ok, yeah, I'm beyond a doubt convinced that we have to get mediawiki dbs into hadoop asap [21:52:53] every time I ask the cross-projects question I get responses like that ;) [21:52:58] But you would win a huge amount of points by just allowing me to run mapreduce on a union of 'archive' and 'revision' tables. [21:53:04] lol [21:53:22] One of the big issues right now is that it's another step to run my enwiki analysis on other wikis. [21:53:32] That prevents me from doing some explorations that ought to happen. [21:56:13] DarTar: When you get back, I need to give you a hard time about log_timestamp >= "foo" [22:16:59] halfak: I've been doing some side-project work on getting sqoop to load these dbs into hadoop [22:17:08] the only big problem is that varbinary is not supported by sqoop [22:17:13] and we use varbinary all over the place [22:17:44] That's surprising since it's just a string of bytes. [22:17:57] Can you have sqoop read from a view? [22:18:02] Or just from a query? [22:18:13] it can do direct table reads or queries [22:18:21] You could convert the type. [22:18:25] there are a few limitations about what it can do with a view [22:18:33] but I think it can read it (same as a query really) [22:18:47] yeah, I tried casting and converting to varchar but it doesn't seem to like that for some reason [22:19:12] hmm... Maybe like a tinyblob? [22:19:24] yeah, maybe. I've just gotta spend some time on it [22:19:52] Yeah. let me not keep you. But I'm down for some playing around with hadoop as soon as you have something pulling the data in. [22:19:56] one more wikimetrics thing to wrap up today and tomorrow, then hopefully I'll get on this more seriously [22:20:09] :) [22:20:32] cool, have a good night [22:20:44] you too [22:22:32] halfak: I'm back [22:23:51] Hey DarTar. I'm still working out the details about how I'm going to work with the data. My plan right now is to create a clone and start building scripts to annotate the tables. [22:24:18] Oh I almost forgot. What's the purpose of log_timestamp > "foo"? [22:24:37] just FYI I had a great chat with julie and I have some nuggets of wisdom to share about our hangout this Saturday [22:25:18] Cool. Do you want to voip? [22:25:23] Or just IRC? [22:26:57] log_timestamp: MW stores timestamps as binary fields and wrapping a timestamp into quotes allows you to compare them much faster than using integers or date/time operators [22:28:51] julie gave me some great input not just on the A/V setup but also on how to stream and potentially capture parts of the hangout [22:29:07] (like howtos/tutorials) [22:30:07] Oh excellent. We should definitely sync on that as I haven't solidified the plan for Saturday morning yet. [22:30:22] I'm thinking that we'll manage the hangout on air in similar fashion to the metrics meets. [22:30:44] Do you want to re-use the hangout for all of the presentations? [22:30:53] yes, the trouble is: you can't set one up in advance [22:31:04] they expire when you close your browser [22:31:54] and the account creating the hangout on air is also the (only?) one who can stream, change settings and download the video [22:33:10] That's dumb. Thanks google. [22:33:49] and I've also learned how to change settings so your google apps account can invite external people to a hangout (seriously) [22:34:33] Is it in your account settings? [22:35:06] yes, under "customize new invites" [22:37:36] Hmmm not seeing that. [22:38:18] I click on my face and the go to "accounts" [22:38:37] sending out a note on the staff list [22:38:51] kk [22:59:29] halfak, sorry, just got back [22:59:36] No worries. [22:59:39] I've done some work with the archive table, yeah, but most of it is really basic [22:59:52] actually I had that same stuff on my plate as a hazy, some-time-in-the-future project ;p [23:00:02] OK. Dartar gave me a lot to work with. [23:00:30] I'm putting together some datasets around deletions that I'll share. [23:00:48] I'm erring on the side of high dimensionality at the moment, so it should be useful. [23:10:55] * DarTar never gives people a lot to work with :p [23:11:05] it's self-inflicted pain [23:11:59] (and MW's logging tables are a real goldmine) [23:13:19] "goldmine" [23:13:28] Welcome, boys and girls, to MW's logging tables! [23:13:28] ha. [23:13:35] GASP! At the historical inconsistencies! [23:13:42] MOAN! At the legacy artefacts! [23:14:02] WAIL! At the weird text escaping in comments! [23:14:15] SHRIEK! At the sheer volume and lack of documentation! [23:14:24] All these things and more await you, so step right up and don't be shy! [23:14:25] But imagine where we'd be without it. :) [23:14:28] btw, Erik's suggestion at the end of my quick review of the AE analysis sounds totally feasible. We can extract the series for the various components and feed them into whatever data viz we want to use (including Limn) [23:14:50] halfak: point. I wouldn't have learnt R's ReferenceClass system. [23:14:54] Agreed DarTar. [23:15:08] Ewww. Python. Use python! [23:15:32] Ironholds: the reality is that nobody ever thought these logs would be used for research when they were designed or bulk populated [23:15:58] halfak: get me the free time to learn python, sure. [23:16:10] DarTar: indeed; have I given you my rant about the 90/100 problem? [23:16:23] hmm no? [23:16:25] 90 of 100 software use cases are 'production' [23:16:35] as a result, those monopole changes authorised benefit production [23:16:41] thus screwing the remaining 10 use cases, which are research [23:16:43] gah, I need to find my internal provider of ibuprofen soon before it's too late [23:16:49] c.f. the switch of user_groups to a key-pair system. [23:16:57] internal provider? [23:17:02] admin [23:17:14] I'm not supposed to disclose this on a public channel [23:17:23] Wait. It wasn't a key pair system before? [23:17:33] well, if it's non-urgent we have like 500 generic ibuprofen [23:17:36] but if you have a problem that can be solved with mild drugs they can be of help [23:17:43] I'll be in tomorrow [23:17:51] nah, I need it now [23:17:55] and, no, it had a blob [23:18:22] Oh.. Wait. What? It's not a varchar--> blob pair anymore? [23:18:24] * halfak queries [23:18:24] which would just read sysop,oversight,checkuser,rollback,whatever [23:18:26] nope [23:18:37] Oh Varbinary. W/e [23:18:39] enjoy having to do a billion nested subqueries to reliably exclude someone from a dataset, bucko [23:18:49] s/Product analyst/Chief MediaWiki Archaeology Officer [23:19:00] Hmm... Shouldn't need to [23:19:01] lol [23:19:13] DarTar, if you can convince gayle to let me have that on my business card I'll take it [23:19:19] she already gave me "more perceptive than the average bear" [23:19:29] added to my todo [23:19:31] hah [23:19:49] Why are you doing subqueries with the user_groups table, Oliver? [23:19:49] I haven't had any experience with that before [23:19:56] woops Ironholds [23:20:05] halfak: it's easy to say "gimme all the uIDs that are admins". It's harder to say "gimme all the uIDs that aren't admins" [23:20:07] Why why why? [23:20:08] The request must have dropped some packets [23:20:12] :D [23:20:16] Na. Left join. [23:20:22] because it goes "hokay!" and gives you the uID for an admin, but for the admin's 'rollback' entry [23:20:33] oh, guh. [23:20:49] wait, okay, you may need to demonstrate this. [23:20:54] SELECT user.* FROM user LEFT JOIN user_groups ON user_id = ug_user AND ug_group = "sysop" WHERE ug_user IS NULL; [23:21:21] I use that trick all of the time. [23:21:25] * Ironholds headscratches [23:21:29] SQL should have an explicit set minus. [23:21:33] it should [23:22:16] okay, that's a weird and conniving trick I'm stealing. grazie! [23:26:08] No problemo hombre :D [23:26:47] we just conversed in four different languages, one of them electronic. [23:26:50] god bless the internet! [23:26:59] sorry, sudo bless [23:27:44] lol