[00:03:07] DarTar, we're sure /public-datasets/ is still syncing, right? [00:06:30] aaand today's gonna be another long one. [00:10:42] Ironholds: it should be [00:11:00] but it might be affected by replication lags, i.e. the data is late, not the syncing [00:11:14] gotcha. I created a new directory, yesterday; nada. [00:11:17] (if we’re talking of data from the slaves) [00:11:23] no? [00:11:29] is that on stat1003? [00:11:32] I mean http://datasets.wikimedia.org/public-datasets/ [00:11:37] or stat1002 [00:11:43] 1002 [00:11:48] ha, that’s why ;) [00:11:56] ...stat1002 doesn't have syncing?! [00:12:02] it doesn’t afaik [00:12:09] it has a public-datasets folder, so it should. [00:12:16] *shrug* [00:12:30] I don’t know what happens with that folder, but yes it’s confusing [00:12:43] I'll check with the engineers in the morning and resist the urge to throw things in the meantime. [00:12:53] anyway, the URL above pulls data from a directory sync’ed with stat1003 [00:12:56] Because if it doesn't have syncing, that translates to "all of the stuff apps wants can't go anywhere they can access it" [00:13:07] yeah, I know. It should also be syncing with a directory synced with '2 [00:13:13] *directory on [00:13:17] I wouldn’t be surprised (but I’m sure they can figure out a solution) [00:13:30] stat1003 syncs every 30 mins afaik [00:13:51] okay. That's a week of work that's currently futile and blocked, then. [00:14:02] nah, simply not sync’ed ;) [00:14:04] I'm going to go for a smoke, do what I can on other projects, and bother otto tomorrow. [00:14:32] when I'm being told the project is super-urgent and I can't deliver the results to the customer, having spent a week making sure we can get those results: it's blocked ;p [00:14:41] keep me posted, I’d love to hear if they can set up something magic to teleport data [00:15:01] it’s blocked until we escalete it to the devs [00:15:11] ottomata can make things happen [00:15:19] indeed! [00:15:27] hence: smoke, work on other things, bother him when he's conscious. [00:15:49] yup and worst case scenario deliver 3Gb of data dumps by email to our customers [00:16:09] no, because they want a daily update. [00:16:17] every day, the code must run, and not break, and update a file. [00:16:23] 3Gb per day [00:16:25] wfm [00:16:30] more like 230b [00:16:46] right [00:17:21] copy me if you ping Andrew (and maybe add Toby too so he can prioritize dev/ops work) [00:17:33] gotta run now, good luck [00:17:41] ttyl [01:23:06] Hey Ironholds. [01:23:16] Want to move some data from stat2 to stat3? [01:23:35] halfak, we're good; looks like it's gonna be resolved at the puppet end :) [01:23:56] OK Cool. Note that you can always use scp. [01:30:51] Or sshfs [01:31:03] (if it is installed) [01:49:32] halfak, totally! [02:04:17] halfak, also, extra points if we can work Seven Minutes to Midnight into the title of the paper. [02:05:26] There's a title's document. You can add it. :) [02:05:31] https://docs.google.com/document/d/1mEMLtv9SiRllirzgggDWZr5ttaNT1h4ENSjlsmBAe2c/edit?usp=drive_web [02:07:26] Putting it into the title is hard. [02:09:04] Ironholds, ^ [02:14:50] halfak, and done! [02:16:51] "sessionization" ... really? [02:17:18] We're changing that when we become the paper that everyone cites. [02:17:41] agreed [02:17:46] and, yes. It makes me sad :( [02:22:34] OK. I am off. G'night folks. [02:23:01] take care dude! [02:40:26] halfak: ping [02:41:06] Hey gwicke! [02:41:09] hey! [02:41:28] I have a quick and sadly not very researchy question [02:41:42] Cool. What's up? [02:41:52] do you know who I should bug about becoming a member of the wikimedia group on github? [02:42:19] saw that you are a member, so figured you might know how that happened ;) [02:42:46] Good Q. I must have an email somewhere about this. [02:42:48] * halfak looks [02:42:49] background is that I'd like to set up tests against different node versions for parsoid [02:43:03] which travis makes pretty easy [02:43:29] thx! [02:44:22] maybe subbu will ask you the same question in person tomorrow ;) [02:44:26] Looks like I can invite you. [02:44:42] ah, that'd be nice [02:47:21] Invite sent. [02:47:49] So, gwicke. I was meaning to talk to you about restbase. [02:47:53] Got a minute? [02:48:19] sure [02:48:29] we were about to have dinner [02:48:46] so actually, would it work for you 30 minutes from now? [02:48:46] Meh. get outa here. I'll read up a bit more and bug you later. :D [02:48:58] tomorrow morning is good too [02:49:07] I'll be here :) [02:49:10] * halfak --> [02:49:11] dinner [02:49:15] cool, thanks! [02:49:35] * subbu pops in and sees both of them heading to dinner [13:21:42] Ironholds, https://www.youtube.com/watch?v=03lenUNQAf8 [14:04:38] morning! [14:05:02] halfak, listening. Confused. [14:05:14] wait, I get it! Bahaha. [14:05:19] :D [14:06:08] I had to listen to the song. [14:06:15] Can't understand a word. [14:08:50] hehe [14:09:01] also, I'll have the session dataset queries launched by EOD [14:09:09] just waiting for all the traffic from 30/10 UTC to come through [14:09:16] great! :) [14:09:27] I spent my it's-11pm-and-I'm-waiting-for-data time last night making sure all the sanitisation code works. It does! [14:09:38] I'm planning to take a look at the AOL data and the related work stuff tonight. [14:09:50] we're looking at all events in the last 30 days from 100k unique appIDs, mobile IPs and desktop IPs [14:10:07] unfortunately because of SSL the actual number of people captured will vary widely, but eh. [14:10:17] Heh. I spent my 11pm-and-waiting-for-data time on stupid mistakes in my SQL for resolving pageviews across redirects -- and it still failed :( [14:10:38] I'm going to build a little table in the readme of "data class" - "file" - "uniques" - "unique_id_type" - "hashing method" - "event_count" [14:11:41] Ironholds, sounds great. I'm thinking that it would be great to bundle a dataset of inter-activity times with the paper. [14:11:59] word! [14:12:03] get a DOI on that shiz. [14:12:03] What do you think about trying to release a dataset containing a hashed ID and timestamp for pageviews? [14:12:29] would you believe me if I told you I'd already structured the sanitisation code to enable us to do this, with that in mind? [14:12:47] uuid - POSIX_timestamp - type [14:13:10] all written as TSVs, unquoted, so reusers can throw them straight into SQL or similar systems if they see a need. [14:13:20] Woo! :) [14:16:06] along the way I wrote a function to parse app UUIDs out of URLs. Because: ffffuu. So, productive evening. [14:21:04] Ironholds, do you think that hashing those UUIDs with a salt will be sufficient anonymization? [14:21:41] I'm a little worried that might not be enough, but then again, what can you do with just the page access timestamps and no title/url? [14:21:42] halfak, I hope so! So, here's how the hashing implementation I've got so far works [14:21:59] for readers on mobile web/desktop, we hash IP, UA and accept_language variant. [14:22:08] we need to do that to generate a uuid anyway, so cool. [14:22:35] for apps and editors, we're hashing the uuid/userid with the date-time at the start of the hashing run to fuzz it [14:22:49] I'm debating using an actual RNG to make it less predictable. [14:23:59] thoughts? [14:25:52] Ironholds, for a salt, it seems like a random number would be better since the date-time could be guessed. [14:26:05] yep [14:26:07] and within a narrow range [14:26:24] if you know we only have 31 days of logs, there's only 24*60*60 options for when the queries could've been launched [14:26:51] heh. Fair point, but 86400 isn't that big of a number. [14:29:15] halfak, yeah, I agree! [14:29:27] okay, so I've replaced it with "grab a random value from the random number source" [14:29:36] which is sort of circular, but works pretty well! [14:29:51] AHhh! MySQL's LOWER() silently does nothing for VARBINARY fields. [14:32:35] oh yeah, that's fun [14:32:43] and BETWEEN doesn't work on negative floats [14:33:01] What? Why not? [14:33:07] at least, not directly. You have to do BETWEEN MIN(val1, val2) AND MAX(val1, val2) [14:33:32] something with how it's called means it gets sad if it has to process the float. I had some GREAT goddamn fun with the geotag lat/lon fields for Brent. [14:33:34] Hmm... Looks like it works fine for me. [14:33:40] huh; iinteresting. [14:33:56] try restricting to a range of lat/lons with negative values [14:33:57] https://gist.github.com/halfak/e2b145c2f10342a7357a [14:34:01] or maybe it's how the data is stored.. [14:34:14] Oh woops. I didn't have my terms as floats. [14:34:52] Yeah. Still works. I just updated the gist. [14:35:28] Oh wait. I see what you're talking about with MIN/MAX(val1, val2) [14:35:46] BETWEEN is ordered, so it expects the first value to be lower and the second to be higher. [14:35:53] yup [14:36:20] and so if you're instead passing it something to be evaluated it gets confused and throws its toys out of the pram [14:36:35] still, The More You Know *rainbow*. And now I know how to get around that. [14:37:38] and now to see if my session code breaks stat1002... [15:13:37] halfak: hola [15:13:44] Hey nuria__ [15:14:12] halfak: what table has the tags for mobile /visual editor split? [15:15:23] change_tag and tag_summary [15:15:35] change_tag has one row per tag. [15:15:40] halfak: thanks [15:16:00] tag_summary has one row per revision with tags concatenated [16:19:00] Holy crap! You can do *fast* random sampling with unix "shuf" [16:19:48] Randomly sampling 100k rows from a 28 million line takes about 1 second. [16:21:54] hmm, should look at the source [17:16:09] coreutils is amazing [17:16:36] FSF's definition of "core" is even broader than WMF's definition of "core" for their budget [17:17:38] Core Features maintains the core features of coreutils [18:06:31] Hey Ironholds. I see a beautiful session cutoff in the AOL data :) [18:06:44] Its the best I have seen so far! [18:08:55] halfak, huh! Feitelson dun goofed? [18:09:08] I think so. [18:09:11] Wooo Science [18:09:20] Let us churn and improve our knowledge. [18:10:24] yay! [18:10:37] say, is anyone in the group a dedicated stats nut? [18:10:53] the results are so consistent I'm getting paranoid we're just somehow mathing wrong [18:11:00] Dan Kluver [18:11:27] He's been working with analysis strategy for over a year (on and off) and has not raised a concern. [18:11:33] But yeah, I thought the same thing. [18:11:45] I keep wondering if I just forgot to switch datasets. [18:11:59] But in the project dir I'm working in right now, i only have AOL data. [18:12:17] haha [18:12:18] Also, Dan did his work 100% independent from me for MovieLens. [18:12:21] awesome! [18:12:23] :) [18:12:33] eeehee [18:12:37] this could be big. Like, really big. [18:12:51] like, Priedhorsky et al big. BIGGER THAN THAT. [18:13:50] :D [18:14:01] This is why we must publish before someone else does! [18:21:49] halfak, remember the mathematician's song! [18:22:58] halfak, https://www.youtube.com/watch?v=gXlfXirQF3A [18:30:22] Heh. :) I forgot about this song. [18:33:06] tch. Tom Lehrer is wonderful. [18:35:33] halfak, new paper title [18:35:42] Threshold Me Closer Tiny Delta [18:35:55] ...I don't know what it would be about, but goddamn it I'll make something up [18:36:04] Lol [18:41:31] Ironholds, https://commons.wikimedia.org/wiki/File:Inter-activity_time.aol_searches.svg [18:46:30] eeheee [18:46:33] * Ironholds dances [18:46:35] this is gonna be so fun [19:11:53] leila, /awesome/ feedback. [19:12:01] (not sarcasm. It's really good feedback!) [19:29:55] halfak: just a FYI, postgres access to all tools on track to be available somewhere next week :) [19:30:56] Woot. That'll work for me. [19:31:02] Thanks for the update YuviPanda [19:31:08] halfak: :) [19:43:41] Nettrom, you work with page views has been successfully replicated. :) [19:43:51] It's beautiful :D [19:44:01] https://scontent-a-lga.xx.fbcdn.net/hphotos-xap1/v/t1.0-9/s720x720/1609968_10152851651916255_6652371154395562127_n.jpg?oh=0b0db2d22505d02a8a38d914dd289863&oe=54F5A24C modelling the 2015 R&D office wear [19:44:38] halfak: awesome! [19:44:52] the run GCC T-shirt appeases engineers, the respirator has an anecdata filter, and the coat is just because the databases are cold and uncaring and I didn't want frostbite. [19:44:58] you'll be seeing it everywhere at CHI [19:45:59] Why do you have a respirator? [19:46:06] Ironholds, ^ [19:46:28] it has an anecdata filter! [19:46:38] also, I was spraypainting my Halloween costume and I choose life. [19:47:36] Good call [19:47:37] :) [19:47:51] I have a problem with uploading figures to commons. [19:48:03] When I fix a graph, I'd like to re-upload over the old name. [19:48:14] But then my notes on the old graph don't make sense. [19:48:34] can you not link to specific revisions of the image? [19:48:56] I don't know if that feature, if it exists. [19:51:18] hmn. example image? [19:51:45] Ironholds, https://commons.wikimedia.org/wiki/File:View_rate_density.by_wikiproject_importance.svg [19:56:30] halfak, so https://upload.wikimedia.org/wikipedia/commons/archive/1/14/20141028201330!View_rate_density.by_wikiproject_importance.svg [19:56:30] ? [19:56:41] click on the image in "file history", link to that. [19:56:59] Oh. I'd like that image to show up in my work log though. [20:00:00] halfak, ahh. hmn. [20:00:46] I do not know how to do that. Womp womp :( [20:01:05] Yeah... I don't think it is possible. [20:01:21] This might be a good feature request for Extension:Graph. [20:01:22] probably a feature, actually [20:01:33] imagine if you could embed vandaltastic versions of images by version_number [20:02:03] Hmm... Not sure how that would be any different from regular vandalism. [20:41:23] T minus 3 hours, 20 minutes! [20:41:35] qchris, hive syncs with the brokers every 15 minutes, right? [20:41:52] s/*/imports from the brokers every 15 minutes, right? [20:42:09] Ironholds: IIRC a bit more often. but roughly ... yes. [20:42:23] but that is not hive, but hdfs. [20:42:42] Data is added to hive only once an hour [20:42:59] aha [20:43:02] perfect! Thank you :) [20:43:06] But if you need data sooner, you can create your own external table, [20:43:13] and add partitions manually there. [20:43:27] Then you can access data as it gets imported to hdfs [20:43:37] (Like new data every ~10 minutes) [20:48:43] qchris, we're good! I just wanted to work out the earliest point I could launch some queries [20:48:56] 30-day span, so they've got 24 hours to run (ish). Earlier launches means more time for do-overs. [21:02:25] DarTar, 1:1? [21:04:18] yep, coming [21:04:24] finishing with toby [21:05:30] Ironholds: alright got a room [21:05:36] kkk [21:05:41] ..wow, bad typo [21:06:05] :) [21:09:06] Deskana, can you reply on trello? [21:10:38] Done. [21:11:19] ta! [21:28:17] DarTar, where should we keep documentation of claim analysis? [21:28:27] WikiGrok [21:29:45] maybe Ironholds knows the answer to this question. [21:30:10] Ironholds, where do you document your work for WikiGrok? [21:30:50] leila, I don't do work for wikigrok? [21:31:20] hookay. figured that may be the case. thanks! [21:31:48] leila: create a subpage maybe? https://meta.wikimedia.org/wiki/Research:Mobile_microcontributions/WikiGrok or https://meta.wikimedia.org/wiki/Research:Mobile_microcontributions/Missing_Wikidata_claims [21:32:25] okay, cool. thanks, DarTar [21:38:54] Nettrom, can barely move today. [21:39:23] :) Was some good squash. Now let's see if I can heal before next Wed. [21:42:54] halfak, results? ;-) [21:44:02] I got Morten for this set, but you would have never guessed it if you saw me walk (limp?) off the court. [21:44:37] It's been about 4 months. [21:44:58] Luckily, it seems that my shot didn't suffer as much as my fitness :S [21:48:31] halfak: yeah, great games! Heal well, I'll let you know how my tournament schedule lines up with playing next week [21:49:11] Oh yeah. I forgot about that. What level are you entering at? [21:49:15] 4.0 [21:49:28] Awesome! [21:49:37] Is it at UMN? [21:49:41] I played in the "B" (4.0-5.0) level at the Indian Summer tournament, seemed to be a good fit [21:50:00] no, this is at the Lifetime downtown, Minneapolis club, and Lifetime in St. Louis park [21:50:11] the U one is in early Dec [21:50:34] Gotcha. [21:52:46] I think you should keep playing every 4 months, then I might be able to beat you ;) [21:53:59] * halfak flexes squash muscles and then recoils from the stiff pain. [21:59:21] good job, halfak. ;-) [21:59:45] this item should make it to the next standup [22:00:06] lol [22:00:32] For weeks where Morten and I get to play, we can update the group on the results at the Friday standup. [22:00:44] This is good. It legitimizes regular exercise. [22:01:04] :D [22:01:59] halfak: still with Ironholds, running a bit behind [22:02:20] No worries. [22:05:53] halfak, leila: :D [22:06:48] I wonder if it'll end up being a rule that you'll have to beat your manager at squash to be able to quit [22:06:53] or something [22:06:58] anyways, I gots to go [22:07:04] night everyone! [22:12:43] For the curious, beating your advisor at squash before defending your thesis is a pseudo-requirement in GroupLens [22:29:22] dammit [22:29:23] is nickserv down? [22:56:42] Ironholds, nope [23:41:51] halfak, you wanna talk briefly about search logs? [23:42:01] Hey! yeah sure. [23:42:08] lemme grab headphones [23:42:26] Call when ready :) [23:42:38] Actually... I need a power cord. BRB