[03:20:10] ewulczyn, you made the .lock file change, right? ;p [03:20:19] (not seen anything going wrong, but was looking at top and thought to check) [03:20:28] yes [03:20:51] yay! [03:20:59] strangely enough it led to race condition and the whole proecess stopped. [03:21:19] I have not thought through the details, but I switched to kafkatee [03:21:32] * Ironholds nods [03:21:41] so I'm streaming data into my table continuosly [03:21:43] weird! Conflicting attempts to create/kill the file? [03:21:45] yeah, makes sense. [03:21:56] Seems to be working on the resourcing front, at least. Hope it's working on the data transfer front! [03:22:10] oh, related; I had an idea for a study you might be interested in participating in/leading on [03:22:20] at what TOD in local time do people donate? [03:22:34] how is that distributed, and how does it vary by donation type (banner response, email response, sidebar link)? [03:22:41] I touched on it a while back but never got any time to dig in. [03:23:37] yeah, right now megan usually does not run tests for more than a few hours, but when she does it looks like mornings and weekends give the highest rate [03:23:51] also, I'm tryng to use ua-parser on stat2 [03:24:01] okay? [03:24:08] but i get this error [03:24:09] from ua_parser import user_agent_parser [03:24:09] Traceback (most recent call last): [03:24:09] File "", line 1, in [03:24:09] File "/usr/lib/python2.7/dist-packages/ua_parser/user_agent_parser.py", line 431, in [03:24:09] yamlFile = open(yamlPath) [03:24:10] IOError: [Errno 2] No such file or directory: '/usr/lib/python2.7/dist-packages/ua_parser/regexes.yaml' [03:24:27] oh! yeah. Fun story... [03:24:35] the pypi version doesn't actually ship with the current yaml [03:24:41] (don't worry, we're...changing that. Ohhh yes.) [03:24:53] you want to grab https://github.com/ua-parser/uap-core/blob/master/regexes.yaml [03:25:13] or better yet, https://github.com/Ironholds/uap-core/blob/master/regexes.yaml which detects the new IE version! [03:26:26] hmm I cant write to /usr/lib/python2.7/dist-packages/ua_parser/ [03:26:54] wait, what? That makes-how did you install it?! [03:27:03] oh, did someone pull the debianised version in? bugger. [03:27:11] it was there [03:27:29] I also ran pip install ... [03:27:34] Okay; grab pypi version, install locally, add yaml to /usr/lib/python2.7/dist-packages/etcetcetc/ [03:27:43] well, pip won't help because stat2 doesn't have an internet connection. [03:28:03] If you want a new library, you need to grab [library] and [library dependencies] and[library dependency dependencies] all the way down. Bah. [03:28:36] (locally - python setup.py install --user) [03:29:30] ok, I'll give it a shot. Also, how do I filter out bots? [03:29:35] Spiders? [03:29:40] oh, the device_family will be "Spider" [03:30:05] oh, I was thinking there were seperate wiki bots that the package does not cover [03:30:10] there are! [03:30:26] https://github.com/Ironholds/WMUtils/blob/master/R/wiki_crawler.R ones I've so far identified [03:30:47] there are probably more but my hand-coding hasn't found em yet if so [03:30:52] (or they're too low-frequency to show up) [03:31:59] will these make it into the python package or should i filter these out by hand in my script? [03:32:18] the latter, or I'll bug halfak into adding them to the Python WM-Utilities [03:32:36] we're probably not adding Wikimedia-specific regexes to a generic UA parser; it's just slow for everyone-but-us as a result. [03:33:48] ok, its simple, but I was hoping there was a way to have the filters update when you make additions [03:34:07] automatically? Nope. I guess, hmn. [03:34:09] * Ironholds thinks [03:34:26] As part of the wider reorg I can reclaim the pypi repo and get people to start actually debianising things again [03:34:31] that'll keep the system-wide version up to day. [03:34:33] *date [03:34:50] but the lack of an internet connection means...well. It's not something !puppet or manual labour can do. [09:14:10] Hi. My name is Lars Roemheld, I'm a graduate student currently doing research on wikipedia user talk pages. [09:14:22] welome, roemheld! [09:14:42] i have a specific database question: does anyone know how user-talk pages are identified to the user? [09:15:07] what do you mean by 'identified to the user'? [09:15:20] I did not find any foreign keys -- is the only identification via page_title==user_name ? [09:15:29] thanks, YuviPanda :) [09:15:34] ah [09:15:35] yes, it is [09:15:36] :) [09:16:05] who came up with that? string-valued foreign keys? ;) [09:16:20] thanks -- that helps! [09:16:26] heh, MW's DB design isn't exactly the paragon of normalization :) [09:16:39] wait till you try to find the text of a revision in the database! :) [09:17:01] oh i tried that. . . [09:17:07] hehe :) [09:17:37] btw. . . is the text table only restricted in the public database, or is there something fancy going on in production as well? [09:17:52] there's something fancy going on in prdo too [09:17:54] *prod too [09:17:58] we've separate machines for those [09:18:31] roemheld: also, do check out quarry.wmflabs.org, useful for quick exploration, etc of SQL against the dbs :) [09:18:36] ah that makes sense. But it's still only a MariaDB table? [09:18:43] yup [09:18:43] it's [09:18:46] just tuned very differently [09:20:02] awesome. thanks a lot for the quick help, much appreciated! [09:21:01] yw [09:24:34] can't log into quarry for some reason -- is there any major difference of using that over going to the DB directly? [09:24:45] nope [09:24:54] just a web interface for folks who don't want to use the CLI, etc [09:25:50] ah ok. I configured an SQL client, so that's the same then [09:25:57] back to data wrangling. Thanks again! [09:26:00] :D [09:26:00] ok [09:26:03] yw! [09:26:09] roemheld: feel free to poke me if you've more questions [09:26:14] roemheld: I'm one of the sysadmins for toollabs / labs [09:28:31] oh awesome. I might get back to you on that one :) [09:28:51] :) [09:48:18] out of curiosity, even simple queries seem relatively slow. Is load heavy on the labs machines? Am I underestimating database complexity? [09:48:29] what queries are you running? [09:48:40] there's revision_userindex if you're hitting the revision table and querying by user [09:50:06] revision count of user talk page for all users who registered in 2013 [09:50:26] roemheld: yeah, use revision_userindex? [09:50:35] roemheld: revision by itself doesn't have index on userid nor username [09:50:36] "where year(user_registration) = 2013" is probably part of the problem [09:50:43] oh :) [09:55:32] wait -- what would i need a userid index on revision for? i'm joining USER on PAGE by page_title, and then joining REVISION by page_id. Is that wrong? [09:55:45] roemheld: hmm, no, that's right [09:55:51] roemheld: let me find you our explain tool [09:56:40] oh no worries I can use EXPLAIN :) Was just wondering if the database is supposed to be really fast or not [09:56:57] roemheld: https://tools.wmflabs.org/tools-info/optimizer.php [09:57:03] roemheld: no, you don't have EXPLAIN rights on the db :) [09:57:07] roemheld: it's supposed to be fairly fast, yes. [09:57:09] has ssds, etc [09:59:28] maybe it's bedtime -- the optimizer page won't load here, request times out :D [10:01:19] hehe :) [10:01:20] ok [10:04:35] my teammate cannot load optimizer either -- something's broken there ;) [10:06:05] either way, I'm giving up for today. Appreciate your help. I might be back ;) [10:06:10] Have a good one! [10:06:16] roemheld: night! [15:55:31] morning [16:11:43] Hey Ironholds [16:11:45] :) [16:11:55] hey halfak! How goes? [16:12:05] Not bad. Feeling much more human today. [16:12:22] Jenny told me that I have some big dark circles under my eyes though. [16:12:26] I wear them with pride. [16:12:34] I have a grad-student beard! [16:12:37] * Ironholds fistbumps [16:12:40] Woo! [16:12:42] THESE THINGS WE DO, WE DO FOR SCIENCE! [16:12:52] (and Movember) [16:13:03] oh yeah, it's November. Eh, coincidence. [16:14:00] :D [16:14:19] Legal asked me to take a look at block logs for Lila [16:14:26] result: I'm looking at code I wrote in January 2013 [16:14:28] * Ironholds sobs gently [16:14:34] Why. Why did I do the things that I did. [16:14:55] This is good. [16:15:04] whyso? [16:15:14] The day that you look at old code and think it is pretty good is the day you stopped learning and getting better [16:15:16] I mean, it's nice to see how much progress I've made, buuut. [16:15:17] heh [16:15:30] there's only one thing I wrote in all of history that I think is *kisses fingers* perfect [16:15:34] Progress is progress. [16:15:43] and that's df <- df[sample(1:nrow(df),size),] [16:15:45] Oh yeah. What's that? [16:15:51] random row-sampling from a data.frame or data.table [16:15:52] nice [16:15:59] (sample() only handles primitives) [16:16:08] you know that you can do sample(df, size), right? [16:16:19] Oh wait. YOu're right [16:16:22] derp [16:16:32] you'd think someone would just have written sample.data.frame [16:16:42] since Sample itself is just a useMethods(class(x)) call [16:16:44] Indeed. [16:16:47] actually, maybe I should write it. [16:16:57] it seems well-duh inducing, and it's like 4 lines of code. [16:17:20] :) You keeping track of these contributions on your CV? [16:17:40] ua-parser architect, R-core committer? Not yet. I will! [16:17:45] +1 [16:18:12] * halfak is bad at putting code contribs on his CV :( [16:20:04] oh gods, except sample is actually a Proper Function and not a series of methods [16:20:08] this'll require some thinking. [16:22:18] I need to send coheed and cambria money. They got me through all that paper writing. [16:22:38] Them and random Super Metroid music remixers on sound cloud. [16:22:48] wait, you're a Coheed fan?! [16:22:53] how did I not know this? [16:23:21] I'm a big Coheed fan. Like: Once queued for 14 hours for one of their concerts, moderator on Cobalt and Calcium, the Amory Wars sketchbook in my bookcase, kind of big. [16:26:09] :) Yup. I've had a couple cycles so far (cycle = going from "woo" to "meh" and back to "woo") [16:26:18] yup, ditto [17:32:05] Research group meeting! [17:32:16] Ironholds, ^ [17:32:20] Where's everyone else? [17:49:50] halfak, said I couldn't make it! Sorry :( [19:52:50] nuria__: re ContentTranslation [19:53:06] leila: yes [19:53:09] I see one entry in server-side-events for Nov. 12 [19:53:18] have you seen that? [19:53:19] in vanadium? [19:53:33] in /a/eventlogging/archive [19:53:54] leila: ah sorry in stat1002, no i had not seen it [19:54:12] let me check if we have a table for it in EL then [19:54:40] humm, still no table in log [19:56:02] actually, scratch that. it's not schema: "ContentTranslation" but title:"Extension:ContentTranslation" [19:56:38] I'm not clear how we should proceed with this. Is it on our plates or LEng's at this point? [19:56:41] leila: so likely events are not being sent, did they deployed their latest code? [19:57:11] leila: i do not think their latest code is deployed yet [19:57:15] I'm not sure [19:57:16] okay [19:57:41] leila: that is the one they have seen working on vagrant [19:59:09] nuria__: Joel's last email on Analytics says "Please let me know if there is any way I can help out or if there is anything you need from our end." [19:59:23] just want to make sure they're not waiting for us while we're waiting for them [19:59:50] leila: until they deploy their latest code I believe there is nothing for us to do [20:00:09] okay. IO' [20:00:13] I'll clarify this. [20:00:14] leila: other than troubleshotting the setup on beta [20:00:22] are you working on that? [20:00:22] and for that we are going to need yuvi [20:00:27] okay [20:00:55] leila: I will ask yuvi today to see if he has any ideas but other than that i ma not sure what else can we do [20:01:08] got it [20:01:36] okay. I'll wait for you to talk to him. and we can send one email with next steps, or any clarification needed. [20:14:40] thanks for the email, nuria__ [20:14:48] leila: np [20:17:44] hey YuviPanda, can we create a Phabricator project page for Quarry? I want to start logging feature requests there. [20:18:01] happy to create the page myself, if you don't mind [20:18:03] J-Mo: we've been asked by qgil to not create projects for things that already exist in bz [20:18:13] ah, it will migrate? [20:18:22] k [20:19:06] btw YuviPanda, I'm going to be using Quarry on Saturday to teach more starry-eyed young data scientists the wonders of SQL :) [20:19:13] J-Mo: w00t :) [20:19:17] J-Mo: awesome :) [20:19:59] AND… while I'm bugging you, YuviPanda have you seen this? http://data.stackexchange.com/ [20:20:17] the only comparable public SQL service I've found on the web [20:31:32] J-Mo, yay hanging with Nettrom! [20:31:50] oh and yeah I guess science is important too. I GUESS. [20:32:03] meh. [20:32:05] ;) [20:32:32] * J-Mo likes people more than data. [20:32:51] When J-Mo and Morten bump fists, a science gets it's hypothesis tested. [20:33:06] lol [20:34:41] J-Mo: yeah, I found it later. surprisingly similar [20:35:47] I haven't tested it out yet, but I prefer the quarry interface. [20:36:02] :D [20:41:13] Ironholds, I just realized you set up a meeting for us to talk to DarTar about session data. [20:41:18] * halfak is very stoked. [20:41:27] yuppp [20:41:34] I'm hoping for a blog post [20:41:37] (like, an official one) [20:41:52] Yes. Oh say. I got the paper up on arxiv. [20:42:03] http://arxiv.org/abs/1411.2878 [20:42:14] I'll forward you the email so that you can "claim" it. [20:42:33] yay! [20:45:32] huh, that's interesting [20:45:57] what's up? [20:47:47] I've been reworking my blocking research, right? [20:47:50] Legal asked for a version [20:48:06] and after spending a day sobbing at what Oliver in January 2013 thought was good code (DEAR GOD, MAN. DO YOU NOT KNOW HOW SCOPING RULES WORK?!) [20:48:24] I decided to look at how long between user_registration and time_of_first_block, over all time. [20:48:45] there's this weird bump if you log-scale the results, ditto if you smooth it. [20:49:31] in fact, in some ways with smoothing it looks a lot like the session stuff (deep fall, small rise, deep fall, smaller rise...) which isn't really shocking [20:50:00] you either survive or don't, and after that your survival probability increases until you hit the "and now we care about behavioural/community-fitting-in" stage of the wiki-life-cycle. [20:50:41] Oooh. [20:50:49] Temporal rhythms :) [20:50:55] it's kinda cool! [20:50:59] Makes sense. Please plot and share. [20:51:04] shall do! [20:55:26] halfak, sent to research-internal [21:05:12] nuria__: do you think we should pull in Ori? [21:12:07] halfak, leila, oh god this is really creepy [21:12:17] so I did what you suggested and histogrammed it, right? [21:12:28] okay! [21:12:34] the log-scale histogram is, after the initial dropoff from ~0-1 day, basically repeated gaussian distributions [21:12:47] :D [21:12:50] send it along [21:12:59] * Ironholds carves "GAUSS IS STALKING ME" into the wall, rips his copy of the session paper up, and runs out screaming about the numbers talking to him [21:13:30] send both log and no-log versions [21:13:47] if the no-log version is useless because of the spike, remove the first few days' data, and send the rest? [21:14:23] done! [21:15:40] the spike is actually continuous, in the sense that the long tail continues appearing until you trim about a year off the front (just with smaller numbers) [21:20:06] leila: just got back from lunch, pull ori for what? [21:20:18] for trouble-shooting EL? [21:20:31] i doubt ori has any time for that [21:20:51] besies i do not think anything is wrong with EL, other events work just fine [21:20:56] *besides [21:21:18] I see. what are next steps? [21:21:26] I think there are several things at play: 1. testing 2.beta 3. sampling rates [21:21:40] 3. is resolved, right? [21:21:42] they're doing 1:1 [21:22:34] leila: did not read my e-mail lemme seee... [21:22:39] k [21:22:50] if it ... ever... loads [21:23:11] Oliver, can you log both axes? [21:23:26] hist(..., log="xy") will work, I believe [21:23:34] Also, can you narrow the buckets? [21:26:31] totally and totally [21:28:54] although not if, as I just discovered, I accidentally kick my power cable [21:28:56] whoops [21:30:05] anyway [21:30:07] nuria__: I'm going to a 30-min meeting. will see you after that [21:30:13] will hack on more later! I wanna get all the session datasets just-so :D [21:30:35] leila: I am not going to do anything in teh short term as i have couple more important things i need to attend to [21:30:40] have filed bug : https://bugzilla.wikimedia.org/show_bug.cgi?id=73388 [21:31:04] nuria__: I know kaldari has used beta for WikiGrok and we see their events [21:31:14] sounds good. thanks! [21:31:22] there are no events in bet asince august [21:31:28] *beta, sorry [21:31:48] leila: so maybe you are thinking prod? [21:36:35] Ironholds, I have the session datasets from the paper pre-sampled and ready to distribute. [21:36:47] halfak, aha, you wanna use those? [21:36:59] I was thinking of just ensuring all the hashing is tight and then distributing the bigger versions [21:37:10] (so, sampled, but less-so, I guess. We need better terms. Sampled sampled versus sampled? ;p) [21:38:58] hmm.. I could be convinced to use the big ones. [21:39:08] I figured the sampled intertime ones are safer. [21:39:16] BUT the big ones are more useful [21:39:33] Either way, I have those packaged up nicely too. :) [21:39:46] Assuming we don't need to re-hash [21:40:00] * halfak considers hashing the hash's hash. [21:40:08] halfak, https://plus.google.com/hangouts/_/wikimedia.org/dtaraborelli-ok?authuser=0 when it starts (unless you wanna come in early and chat through all of this, I guess?) [22:21:05] halfak, are your registration date tables on both stat1002 and 1003? [22:21:28] you mean analytics-store? [22:23:00] yes [22:23:09] (sorry) [22:25:40] * halfak checks [22:26:14] looks like no Can load that in. [22:26:23] if it's straightforward. [23:44:54] leila, sorry for the delay. I'm loading the table ATM. [23:45:08] no worries. no rush. thanks!