[13:07:33] morning [13:12:46] Morning :-) [13:43:19] morning milimetric , drdee , qchris_away [13:43:25] hey! [13:43:29] morning [13:43:56] milimetric: wanna have a chat ? [13:43:59] sure [13:44:09] hangout ? [13:45:24] joining, one sec [13:45:34] thank you [14:04:10] hey qchris_away, you wanna task out cards this morning? [14:04:22] sorry, afternoon for you :) [14:47:35] @rss-on [14:47:35] Rss feed has been enabled on channel [14:47:40] False [14:47:41] False [14:47:42] False [14:47:42] False [14:47:43] False [14:47:44] False [14:47:45] False [14:47:46] False [14:47:46] False [14:47:47] False [14:47:48] False [14:47:49] False [14:47:50] False [14:47:50] False [14:47:51] False [14:47:52] False [14:47:53] False [14:47:54] False [14:47:54] False [14:47:55] False [14:47:56] False [14:47:57] False [14:47:58] False [14:47:58] False [14:47:59] False [14:48:40] screw you bot [14:48:56] @rss-off [14:48:56] Rss feed has been disabled on channel [14:54:15] qchris: you back? :) [14:54:23] milimetric: Yes. [14:54:29] hangout / tasking? [14:54:38] milimetric: Didn't we say we'll do that on mondays? [14:54:53] yes, but what do I do until then? :) [14:54:59] I could task it out myself I guess [14:55:06] 1 sec. [14:55:13] but I thought it'd be a nice thing to explore together a bit, then monday is smoother [14:55:42] I'll join you in the hangout, as I cannot make sense of our current wall. I suck. [15:29:44] (PS1) Milimetric: abiding by flake8 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91190 [16:51:25] ok just lost t 1hour of work because OpenOffice crashes when I change font size for TimesNewRoman to 10 and recovered document is empty -- it's 2013 and i can't change the font size without making it crash?????? [17:01:42] dawwww, save man! [17:01:45] you gotta save [17:01:49] ain't no autosave in open office [17:01:51] hehe [17:01:58] whoa I just discovered this tool [17:01:59] http://www.ivarch.com/programs/pv.shtml [17:02:02] super useful [17:02:20] shows me #lines consumed and lines per second if I pipe kafka or firehose through it [17:08:26] pipe viewer FTW [17:08:42] It can even give you a progress bar :) [17:09:11] nice [17:09:38] qchris yt? [17:10:25] DarTar: Sorry, I am in a meeting right now (probably for the next 45 minutes). [17:10:35] k np [17:26:07] (PS1) Stefan.petrea: A step towards 818 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91207 [18:26:15] (PS1) Milimetric: Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 [18:26:47] average: take a look at that patchset ^ [18:26:47] https://gerrit.wikimedia.org/r/#/c/91214/ [18:27:17] I cleaned up the create_test_cohort method and used it in the threhold tests [18:27:43] it allows you to very cleanly test time-dependent stuff by specifying the registration date relative to now() [18:47:16] drdee! drdee! [18:47:39] YuviPanda! YuviPanda! [18:48:00] YuviPanda: check this out https://www.mediawiki.org/wiki/Analytics/Hypercube [18:50:16] drdee: that bug is just to copy the raw data over for people who want to do stuff with it [18:50:22] right [18:50:29] we're doing sort of the same thing [18:50:58] drdee: milimetric so what I propose to do is just have a cron that rsyncs things over every hour or so [18:51:02] onto NFS [18:51:06] so would "do stuff with it" be easier in mysql or flat files? [18:51:29] definitely easier in mysql, but copying it over in flat files is a ~6 line patch :D [18:51:31] I got it, but it seems in line with our effort, so it's worth talking about it for a bit [18:51:34] getting that into mysql isn't [18:51:35] sure! [18:51:52] so far, we've written scripts that ingest it into hive [18:52:08] it might be around 10-100 lines to move it from hive into labs [18:52:15] and then it might be more useful [18:52:27] people can write APIs on top of it, etc. [18:52:44] more importantly, we're planning on cleaning this data up by using a different stream than webstatscollector [18:52:55] so anyone downstream of us would benefit [18:52:58] right [18:53:05] the current bug's idea is to just make http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-10/ available [18:53:07] on NFS [18:53:24] so just the last month [18:53:43] from what i know pagecounts-raw is a good bit unreliable [18:53:43] hm hm hm [18:54:13] yeah, our work so far diffs what's in hdfs and what's on pagecounts-raw and imports anyting missing [18:54:58] right now you can go into hive and go "select page, sum(views) from pagecounts group by page where year=2013;" and get what you'd expect [18:55:04] milimetric: are you starting from pagecounts-raw or are you starting from direct logs? [18:55:10] pagecounts-raw [18:55:15] agree with milimetric let's join forces, if you can help write the patch to get the data from hive into labs then we have the same data but in queryable format [18:55:29] well, let's not ram this down poor YuviPanda's throat :) [18:55:36] +1 to milimetric :P [18:55:42] +1 to milimetric [18:55:42] but let's do it if it makes sense [18:55:45] it is literally a 6 line patch [18:55:53] and I'd *love* to have that data in a more queriable format [18:55:54] that flat files [18:55:56] yeah, but so is copying all of google onto labs :) [18:56:00] question is - does that bring value [18:56:15] well, my prime target here is toollabs in particular [18:56:33] so my question here is primarily one of size [18:56:46] *IF* we put this in mysql, where do you think we'd have to truncate data? [18:56:51] 100GB? 1TB? [18:57:16] I know the size limit for a single instance on labs is like 160 GB [18:57:20] probably 'gradual' loss of data [18:57:33] milimetric: pagecounts-raw will be on nfs, so there's no size limits :D [18:57:35] but since you're talking about 3TB, I'm assuming you're not limited by those [18:57:45] it seems a waste of scarce resources to copy the same data twice to labs [18:57:47] no, sure, but if we try to put it in mysql [18:58:28] would there be a limit, theoretical or practical that you know of YuviPanda? For a Hive -> MySQL dump? [19:02:52] milimetric: hmm, haven't worked with hive at all yet [19:03:08] uh, sorry, you can ignore hive [19:03:21] it's a mysql question in two parts [19:03:25] right [19:03:32] 1. could we get a big enough database to dump pagecounts-raw into? [19:03:38] did you ask Ryan or Coren? [19:03:42] 2. would that database grind to a halt or not [19:03:43] Ryan, more [19:03:46] right, not yet [19:04:06] and 3T? I bet it would if you dumped it all :D [19:04:26] right, so we were thinking daily [19:04:44] so reducing the resolution to daily [19:04:48] and that's like 2 lines in hive [19:05:00] right, but that would also be a problem in mysql since you'll have to purge the old data out [19:05:06] honestly, I don't think mysql is the right thing for this at all [19:05:32] you were saying above you'd like to have this data queryable [19:05:44] you think standing up an hdfs cluster on labs would work? [19:05:56] because then it'd be *really* easy to export hdfs -> hdfs [19:06:18] yeah, if we have a publicly available hive / hdfs thing, that'll be *great* [19:06:22] much better than mysql [19:06:22] and we have the cluster, andrew's built it already [19:06:34] well, i mean, it's a lot slower, but ok [19:06:45] i wouldn't use the existing labs cluster for this, but it'd be trivial to fire up a new one [19:06:56] i don't think having the 3TB of flat files on NFS would do anyone any good, honestly [19:06:56] the existing one is kidna messy and used for prototyping changes to the production analytics cluster [19:07:09] ok, that sounds cool [19:07:18] we are fully aware of the scalability concerns but we just want to get something out there and see what types of queries people are running [19:07:44] well, i mean, let's maybe start with hive in labs [19:07:46] see how that goes [19:07:49] +1 [19:07:53] we can log those queries just the same [19:08:18] latersa all! [19:08:35] milimetric: i'd really like to make it available from tools, since tools already has a lot of users [19:08:39] ok YuviPanda, when ottomata comes back, I'll spin up a cluster with him and we'll import what we have there [19:08:44] ok [19:08:48] i'll still write the rsync anyway [19:09:12] far be it from me to get between a man and his rsync :) [19:09:27] but thanks for the input [19:09:31] https://gerrit.wikimedia.org/r/#/c/91217/ [19:09:31] my goal: [19:09:55] ssh tools-login [19:09:55] hive pagecounts [19:10:05] bing! [19:10:07] hive prompt into pagecounts-raw> [19:11:07] k, I gotta run grab my car from the shop [19:11:10] ttyl [19:42:38] milimetric: looking at the patchset https://gerrit.wikimedia.org/r/91214 [19:42:57] cool [19:43:08] lemme know what you think of create_test_cohort [19:43:15] I added user_registrations [19:43:31] and check out how I used it in test_threshold [19:48:55] so average, I still have to write the documentation for time_to_threshold. Namely, I'm writing the inspiration SQL in the class __docs__ of Threshold [19:49:06] When I submit that I'm ready for you to merge this, I'll ping you [19:52:09] ok [20:02:43] (PS2) Milimetric: Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 [20:03:07] ok average, I added the query that should theoretically get time_to_threshold [20:03:39] here's a link to a test of the trick I used: http://sqlfiddle.com/#!2/b6f502/8 [20:04:43] looking [20:05:42] wow, sqlfiddle, I didn't know that existed [20:14:56] milimetric: hey that's a really interesting and smart query, so rank is sort of like a chronological index of the revision [20:15:10] right [20:15:17] you count how many revisions are greater than that revision, and you get the rank [20:16:12] now getting the difference between rank 1 and rank number_of_edits in the time interval would yield the number in hours ? [20:16:48] I mean it would compute threshold_hours [20:16:58] milimetric: is that correct ? [20:17:20] wait, let me rephrase that [20:17:45] the difference between rev_timestamps of the revision with rank number_of_edits and rank 1 would be threshold_hours [20:19:12] * average is going to use sqlfiddle to try to add this to the query [20:31:33] no average [20:31:56] the difference between user_registration and rev_timestamp* in hours is time_to_threshold [20:32:09] where rev_timestamp* is the nth revision where n is number_of_edits [20:32:36] and so I'm doing that subquery to get all the cohorts' users' revisions ordered by timestamp and ranked [20:33:07] it's a stupid hack really because in sql server I just do row_number over (rev_timestamp partition by rev_user) [20:33:19] and ^^ that is about 100 million times faster and better [20:33:28] but meh, we're in crappy mysql world and we have to live with it [20:34:20] YuviPanda: 14 lines of code, twice as many as you predicted [20:34:24] https://gerrit.wikimedia.org/r/#/c/91293/ [20:35:32] milimetric: ah alright [20:35:36] grr, postresql has row_number / partition [20:35:43] goddamn it, why does anyone use this mysql crap [20:35:45] I haven't been able to reproduce that [20:36:09] milimetric: maybe it's in mariadb ? they diverged a bit since oracle acquired it [20:36:20] oh no, it's mysql [20:36:32] their goal in life is to ruin all that's good and holy about ANSI SQL [20:36:42] mysql and all its stupid children [20:36:52] seriously, it's a database for people who hate SQL [20:37:01] * milimetric takes breath, puts down gun [20:37:17] drdee: we made a new nfs mount point, that shouldn't count! :P [20:37:28] 6 == 6 [20:37:56] oh, not to mention mysql doesn't have recursive CTEs. UGH! [20:37:59] * milimetric picks gun back up [20:40:44] I was actually reading through dubois book in the weekend [20:41:28] milimetric: I saw this on SO => SELECT *, (@row := @row + 1) AS rownum [..] [20:41:41] yeah, another horrible hack :) [20:41:42] milimetric: can we use that instead of ROW_NUMBER ? [20:41:52] no, the hack I used is a little better I think [20:41:58] for large cohorts [20:41:58] ah ok [20:42:04] I *think* [20:42:14] I have no intuition on hacks as I refuse to acknowledge they exist [20:42:45] mysql hacks that is, my own hacks are beautiful :) [20:45:36] * YuviPanda screams Coco then runs away fasssst [20:47:15] lol [20:47:21] Coco has nothing to do with me [20:48:16] I just got used to it and I since I wrote 99.9% of the code after David left, I kept it. We'll soon have a new master in the house and they will decide the fate of Coco [20:48:22] I will look by with indifference :) [20:49:52] hey, I ran away! :P [20:50:10] just trolling unproductively, milimetric. Last one, I promise [20:50:40] I was being very productive with my MySQL rant :) [20:50:52] considering I probably offended like literally half the internet :) [20:51:14] what, no [20:51:16] everyone uses mysql [20:51:20] and also agrees that it's shit :P [20:51:22] just like PHP [20:51:49] it is usually a 'meh, it works for me!' than a 'no you are wrong mysql is awesome' [21:13:55] average: I was going to do time_to_threshold [21:14:21] are my patchsets ok to merge or did you have comments? [21:16:31] milimetric: looking one last time at the query [21:16:45] well, i mean, the rest of it too - tests, docs, etc. [21:17:23] milimetric: I was looking at line 60 threshold.py [21:18:04] milimetric: tables "rev" and "revision" aren't in the FROM clauses [21:19:12] checking [21:19:58] the tables available in FROM clauses are the following right now : "ordered_revisions" , "r1" , "r2" , "user" [21:20:47] (PS3) Milimetric: Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 [21:20:51] updated average ^^ [21:21:01] lookig [21:21:03] *looking [21:22:27] oh YuviPanda, I saw that we have a new Cassandra playground [21:22:33] what about inserting pagecounts data there? [21:22:48] that's certainly a good way to stress test it, no? [21:23:22] milimetric: (unix_timestamp(rev_timestamp) - unix_timestamp(rev_timestamp)) / 3600 == 0 ? [21:23:57] milimetric: in prod? [21:24:08] milimetric: probably, yeah. poke Gabriel? [21:24:13] I have to read up on Cassandra too [21:24:15] so many things to do... [21:25:11] oh, if it's in prod, it doesn't lend itself to easy public access. We'll just put it in hive for now [21:25:20] yeah [21:26:14] average: you mean on line 74? [21:26:51] we need to compare to threshold_hours [21:27:04] and when I say I'm loosely talking about it, it would obviously have to get converted to seconds [21:27:06] milimetric: line 60 [21:27:31] no, line 60 is just computing the actual time_to_threshold in hours, if the threshold is reached [21:28:54] it's saying: [21:29:01] milimetric: yes but you're subtracting the same column from itself, so you'll get 0 [21:29:08] lol [21:29:09] doh [21:29:12] :) [21:29:33] (PS4) Milimetric: Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 [21:29:35] updated [21:29:40] ugh [21:29:41] sorry [21:29:56] (PS5) Milimetric: Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 [21:29:59] had it backwards [21:30:44] looks good now, gimme 2 more minutes, wanna go through the tests again [21:31:31] cool beans [21:31:51] do you like the general setUp / now() - relative logic though? [21:33:56] oh, that one looks really nice [21:34:10] we can reuse it in more of the tests now [21:35:51] totally, yeah, especially in the time_to_threshold tests you're writing [21:36:19] you can almost just copy the tests I wrote and change the asserts to hours / None from True / False [21:36:40] oh duh - or just add asserts to those tests :) [21:37:53] huh, crazy [21:37:59] is there a way to use create_test_cohort to make the users have different registration dates ? [21:38:00] so actually you know our task etherpad? [21:38:03] maybe silly question ,still reading [21:38:10] oh no, not silly [21:38:12] and yes, you can: [21:38:32] oh yeah, it's an array right ? [21:38:34] user_registrations=[reg1, reg2, reg3] [21:38:35] yep [21:38:40] you turn it into an array if it's not a number [21:38:48] if it *is*, yes [21:38:52] yeah sorry [21:38:59] yep, same as the other arrays [21:39:13] just in this case it's one-dimensional not two-dimensional [21:39:46] have only one comment [21:40:33] in the old tests, the user registration dates are different, you could take them from there and use them in the current tests so they still pass [21:40:40] so, average, I was thinking: 4 and 6 on our task wall are the same thing [21:40:40] this metric is basically: [21:40:40] compute time_to_threshold [21:40:40] threshold = time_to_threshold is not null [21:41:13] true [21:41:13] no, the tests won't pass because threshold now returns 'not implemented' [21:41:31] yes, so we'll drop the tests and rewrite them [21:41:44] no, the tests are great [21:41:53] they *should* pass once we implement it [21:42:08] you should just add asserts for time_to_threshold and put the expected values [21:42:57] they won't right now, because of line 24 in test_threshold.py [21:43:35] they all have the same registration date [21:43:43] why's that bad? [21:43:44] No one told me so I was forced to assume which way to do that [21:43:48] yes, but in the old tests the users had different registration dates [21:43:59] lemme link the line with that, moment [21:44:23] the new tests are totally different though [21:44:34] they're written based on the definition of the metric and the new setUp [21:44:46] https://github.com/wikimedia/analytics-wikimetrics/blob/master/tests/fixtures.py#L732 [21:45:01] registration_date_dan = format_date(datetime(2013, 1, 1)) [21:45:04] registration_date_evan = format_date(datetime(2013, 1, 2)) [21:45:04] oh right, but I don't inherit from that [21:45:07] registration_date_andrew = format_date(datetime(2013, 1, 3)) [21:45:13] I'm inheriting from DatabaseTest [21:45:26] not DatabaseWithSurvivorCohortTest [21:45:34] I see the disconnect now [21:45:55] yeah, so that whole class can basically be replaced with create_test_cohort now that it has user_registrations (sorry I didn't do it earlier) [21:46:04] but you know, I get less stupid every day :) [21:47:27] so right now, the expected values of the asserts are based on the old registration dates or the new ones ? [21:47:44] everything's based on the new setUp [21:47:57] the tests are isolated completely from the old logic [21:48:12] and should pass once threshold is implemented correctly [21:48:13] aaah, alright, everything looks good [21:48:24] I can +2 [21:48:27] cool - then feel free to hit that +2 [21:48:46] and then the question is, can I work on 4&6? [21:49:03] and you can finish 5 [21:49:23] I agree [21:49:27] then tomorrow you can review my 4&6, fix it if needs fixing, and merge [21:49:32] ok +2 , and I'll work on 5 [21:49:34] and we can work together on tasking 818 [21:49:40] yes [21:49:42] cool [21:49:43] that sounds awesome [21:49:45] :) [21:50:14] (CR) Stefan.petrea: [C: 2 V: 2] Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 (owner: Milimetric) [21:50:23] thanks [21:50:54] did it get merged, I think I did something wrong [21:51:02] and only published comments instead of submitting as well [21:52:20] oh you can do it again and click "publish and submit" [21:52:25] milimetric: how can I find out why it says "Can Merge: No" ? [21:52:26] The code is compiling [21:52:31] oh [21:52:35] uh... no idea [21:52:41] maybe you're not allowed to merge? :( [21:53:00] you can ask drdee, and maybe I'll merge it for now [21:53:20] oh! [21:53:20] no [21:53:38] it might be because it depends on this: https://gerrit.wikimedia.org/r/#/c/91190/ [21:53:45] average, maybe try to merge that ^^ [21:53:47] (CR) Stefan.petrea: [C: 2 V: 2] Cleaning up Survival and Threshold [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91214 (owner: Milimetric) [21:54:00] milimetric: aaa [21:54:03] ok looking [21:54:30] that was just drdee using common sense spacing. Silly Doctor, that's not allowed on wikimetrics [21:54:35] here, flake8 is king! [21:54:45] welp [21:54:49] that's not pep8 at all [21:54:53] it is [21:55:04] flake8 is ALL standards [21:55:13] no spaces around equal signs when passing parameters to functions [21:56:03] ALL == pyflakes + pep8 [21:56:22] just run flake8 before submitting, it's easy [21:56:23] * average is having a look at 91190 [22:04:12] milimetric: 91190 looks good, I'll +2 it [22:04:59] 91190 contains some spacing adjustment for flake8 to be happy, that's my understanding [22:05:29] (CR) Stefan.petrea: [C: 2 V: 2] abiding by flake8 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91190 (owner: Milimetric) [22:06:15] ok, change got merged [22:06:19] sweet [22:06:22] :) [22:06:48] I'm working on 4&6, I think it'll be pretty fast [22:07:09] but I have to run out in a bit to meet my wife at the station [22:07:23] dude, it's like 01:00 for you, you should sleep! [22:07:23] :) [22:07:27] true [22:08:15] the dude abides [22:08:49] ok, I should be off, and I'll be working on 5 tommorow [22:08:57] laterz [22:10:12] cool, have a good night [22:12:40] you too [22:12:41] ttyl [23:04:09] hey tnegrin [23:04:22] are you in the office or joining remotely? [23:04:42] (PS1) Milimetric: not yet ready to merge, just checkpoint [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/91309 [23:05:00] nite everyone