[00:31:12] halfak: https://meta.wikimedia.org/w/index.php?diff=10765273 [00:34:28] Helder: https://meta.wikimedia.org/w/index.php?diff=prev&oldid=10765287 [00:36:35] thanks halfak ! [00:37:31] BTW: doesn't saving this on-wiki makes it CC-BY-SA? (instead of a proper license for code) [00:39:38] Meh. it's a snippet of pseudo-code [00:39:43] I wouldn't worry about it. [00:40:06] I'm off for the evening. Have a good one. o/ [00:42:30] you too [15:16:52] halfak: good morning! [15:17:02] Hey! [15:17:08] good morning :) [15:17:52] how'd the streaming experiements go? [15:18:03] I'm struggling to find a way to measure CPU usage. [15:18:13] I thought I could do it with hue, but I can't find anything about it. [15:18:27] Also, it looks like I'm only using one reducer. That might be a problem. [15:19:20] with json too? [15:19:38] hm, halfak, do you have an application_id of one you ran? [15:19:50] 1415917009743_41120 [15:20:03] This one was uncompressed avro [15:21:12] halfk, try job history server: [15:21:16] ssh -N bast1001.wikimedia.org -L 19888:analytics1010.eqiad.wmnet:19888 [15:21:18] then [15:21:20] http://localhost:19888/jobhistory/job/job_1415917009743_41120/jobhistory/job/job_1415917009743_41120 [15:21:43] hmm, still not quite what we are looking for [15:22:34] I'm thinking something like "time spent processing" [15:22:37] yeah [15:22:41] CPU seconds? [15:22:43] cumulative cpu [15:22:45] is how hive reports it [15:22:48] cool [15:23:29] Say, what's the right way to shut down an ssh tunnel? [15:23:31] ^C? [15:23:58] ja that's fine [15:24:10] Hmm... I seem to not be able to start new tunnels. [15:25:18] Yup. Can't even tunnel to hue anymore. [15:25:53] that's strange [15:27:02] Indeed. I was having issues with this before :( [15:27:08] Could be something in my ssh is borked. [15:27:22] Are you finding cumulative CPU in job history? [15:28:34] ah! [15:28:34] found it [15:28:41] mapred job -status job_1415917009743_41120 [15:28:51] on stat1002 [15:29:25] CPU time spent (ms)=3062440 [15:29:33] Wooo! [15:29:35] Cool. [15:29:38] Thanks! [15:29:40] yup [15:29:48] Will run more tests :) [18:07:51] Ironholds: healthcheck? [18:08:34] I'm in the batcave as requested [19:45:07] lzia, do you have any time today for stats? :D [19:45:22] DarTar, what's your thinking on the research showcase? [19:46:55] Ironholds: let me forward a note I got from reid [19:47:05] kk [20:00:34] DarTar: How do you move cards from the trello backlog board to the main board? [20:01:27] Ah, I see nm [20:07:35] Ironholds: ok, I’ll wait to hear from Reid and confirm then. Hope we can stick to tomorrow, 11 [20:07:42] kk [20:07:58] as said, I just need an ellery or a leila and I'm good [20:08:52] Ironholds: ping them, they’re here [20:09:22] shall do, just setting a query to run first [20:11:29] it works! MY CODE WORKS. [20:12:21] Wooo! [20:12:42] Are we having another staff meeting today? [20:12:47] yup [20:12:55] I have had maybe two hours meeting-free, since 10am [20:13:02] holy CRAP [20:13:03] Wait... but we had one. [20:13:15] this UDF is /fast/ [20:13:17] oh yeah, point [20:13:17] You got me beat. I'm at 30 minutes free. [20:13:17] huh [20:13:21] aw :( [20:13:39] s'ok. I'm hadoopin' now :) [20:13:57] so, the UDF andrew and I wrote? [20:14:10] identifies 1m pageviews in the time between "as said, I just need an ellery.." and now [20:14:12] this thing is FAST. [20:17:22] Turns out that mapreduce is way faster when you do it in parallel. [20:17:31] :S [20:17:53] heh [20:17:55] The default reducer count for my job was 1 [20:17:56] :( [20:18:00] I would phrase that as: the same speed but takes less realtime to complete :D [20:18:07] 146 seconds [20:19:12] :P [20:20:37] holy crap [20:20:46] Java: Like C++ But It Doesn't Make You Hate Yourself. [20:22:04] Yeah... Also, not much faster than python. [20:23:41] halfak: Java is usually much faster [20:23:52] at least on the hotspot VM [20:23:54] than Python, that is [20:23:58] Compared to pypy? [20:26:04] Bah! The internet doesn't know! [20:30:31] halfak: yeah, even compared to pypy, I would say [20:30:43] hotspot vm, at least (that’s the default one) [20:30:46] Personal experience or empiricism there? [20:31:42] I'd totally believe empiricism. Lots of large shops spend time and energy making JVM fast [20:31:58] empiricism [20:32:04] Python is more for the scientific computing and web dev community. We, historically, don't contribute much beyond libraries. [20:32:05] plus benchmarks I looked at a few months ago :) [20:32:10] gotcha. [20:32:15] What order are we looking at? [20:32:35] Twice as fast? 10x? 100x? [20:33:41] halfak: http://benchmarksgame.alioth.debian.org/u32/python.php is python vs Java, and lots of 40x, 60x [20:33:45] that’s cpython [20:34:15] but pypy isn’t a 40x *general* improvement [20:35:42] "Lots of 40x and 60x" isn't quite accurate [20:36:05] Range is 0-60% and pretty linear for the algorithms applied. [20:36:20] I guess the median is 35x [20:36:25] But the next lower is 16 [20:36:29] and 4 [20:36:33] and then ~0 [20:36:45] halfak: very much depends on what you’re throwing at it [20:36:57] It seems. Java is generally faster though. [20:37:09] * halfak should look into jython [20:37:19] I have no idea what sort of libs I can and can't take with me. [20:37:40] If I can take most of the utilities I have built, there's not much reason to not play around with the JVM. [20:38:08] halfak: i just finished something, thikning about coming back to experiments for the day [20:38:10] how goes? [20:38:23] Running the full set right now :) [20:38:29] I have the table half-updated. [20:38:32] oh awesome [20:38:34] I've been playing with reducers. [20:38:34] looking [20:38:38] ja? [20:38:42] Definitely getting far fewer splits out of the bz2 [20:38:50] I need to save. Just a sec. [20:39:08] ah k [20:40:08] Saved [20:40:17] Working on the last few. [20:42:02] whoa, you could read the avro uncompressed? [20:42:10] i thougth that file was messed up [20:43:12] heh. Seems to have worked. I'll check the output. [20:45:26] Ack! It looks like I accidentally compressed the avro-uncompressed output. [20:45:54] Hmm... Looks like snappy compression happens by default. [20:46:01] ottomata, do you know how to turn that off? [20:46:54] when you save? [20:47:05] when you write output? [20:47:12] Oh... derp. compression=false [20:47:19] -Dmapreduce.output.fileoutputformat.compress=false [20:47:20] yup [20:47:22] I thought I needed to set a UncompressedCodec or something [20:47:52] I'll run those again. [20:55:32] Ironholds, sorry I missed your earlier message. If the stats questions are about your presentation tomorrow or something urgent, let's chat today. Otherwise, Friday? [20:55:43] they are indeed about tomorrow [20:55:49] wanna talk in about 35m? [20:55:51] if you're free [20:56:02] hooki. do you wanna talk now? [20:57:37] I am knee-deep in pageviews stuff :() [20:57:45] halfak, would you count edit attempts as pageviews? I would not, personally. [20:57:57] I don't think so. [20:58:15] good-o [20:58:19] * Ironholds gives Mobile the evil eye [20:58:20] You get a separate request for edit-success, right? [20:59:02] oh, that SHOULD be excluded by MIME type. should. [20:59:46] Dear MediaWiki Developers: [20:59:55] DarTar: halfak so, after my current biggish project ends (in a week or so), I would have time to do either some Quarry upgrades or RCStream+ [20:59:59] which would be more useful? [21:00:12] Please Just Goddamn Get Together And Agree On If Special Pages Should Exist As Index.php Parameters Or Not [21:00:26] * YuviPanda pats Ironholds [21:00:48] I'm having to seriously hack at this Java to get this to work [21:00:48] all because some lazy bloody sod couldn't be bothered to specify MIME types usefully. [21:00:50] YuviPanda, does RCStream+ include rewind? [21:01:06] halfak: up to a certain point (last hour / day?) [21:01:57] Seems like a good place to start. Personally, I want to be able to go all of the way back to the beginning of RecentChanges. [21:02:14] you mean 30 days? [21:02:15] Then again, I could read all of the recentchanges and then ask rcstream+ to pick up from there. [21:02:18] or however long we keep rc around? [21:02:25] yeah. 30 [21:02:39] well, depends on what architecture I end up coming up with, who knows :) [21:02:42] but probably not [21:02:47] But really, if I could come to rcstream and say, "Start giving me events after rcid " [21:03:01] And have it say, "Cool. here you go" [21:03:02] halfak: the main use of ‘resume’ is ‘oh I restarted my service now, can you resume from here?' [21:03:15] halfak: yeah, I think that’ll have to be built as an abstraction on top of labsdb + rcstream+ [21:03:19] or "I don't have rcid , so you're going to have to get them from the API/DB" [21:03:44] Really, I need to be able to have guarantees about what I do and do not get from rcstream. [21:04:07] I guess I could buffer from rcstream while I read the API/DB. [21:04:11] indeed, that would be a defined parameter, probably along the lines of ‘last 5000 events’ or ‘last 50000 events’ depending on how I end up building it [21:04:13] But that's painful. [21:04:26] halfak: yeah, so we should write an abstraction over it :D [21:04:45] Indeed. Preferably an abstraction that a user will not see. [21:05:13] halfak: probably in the form of a client library, that knows what to hit... [21:05:24] we could even call it mw-events or something... [21:05:24] :P [21:05:34] Agreed. [21:05:40] But what about java devs? [21:05:44] And perl devs [21:05:49] screw them? :P [21:06:00] Ha! [21:06:02] :P [21:06:09] halfak: perl devs, *definitely* :P [21:06:16] halfak: we could write an abstraction for this too, in rust ;) [21:06:34] Been meaning to pick up rust. Also, have found zero hours for that. [21:08:04] halfak: yup. it’s good time - libraries don’t exist. I’m slowly fixing up a redis library... [21:08:21] halfak: but anyway, an alternative is to dowhat you said - put up a network service that recounts from whenever. [21:08:33] halfak: but, if you build mw-events, you can just as well setup another service that’s just mw-events... [21:08:41] yup. [21:08:54] Right now, mwevents works just fine, but doesn't take advantage of RCStream [21:09:23] Also, it would be great if it could use the DB. [21:09:29] indeed, it should... [21:10:02] Polling the DB... Hey DB got anything new? How about now? And now? And now? ... [21:10:20] halfak: sean might hate you :P [21:10:20] or not [21:10:34] If someone can help me stand up a better solution.. [21:10:46] halfak: well, we could fix rcstream to be not lossy :) [21:10:58] which would mean using lpush / rpop instead of pubsub [21:11:23] Now we're talking. [21:12:31] halfak: :D it’s the great pubsub vs lpush/rpop schism :) [21:12:43] latter is not lossy at all, but can blow up if not done carefully [21:12:48] plus also isn’t as fast as previous [21:13:03] which is lossy (if client disconnects) but also super fast [21:13:32] halfak: however, *assuming* that the rcstream service at wikimedia itself isn’t lossy, it’s trivial to set up a lossless version. [21:13:34] like, really trivial. [21:13:58] YuviPanda, is this also true if a client comes in with a notion of state -- when the connection was lost? [21:14:57] That way, a queue wouldn't need to build up after the client goes offline. [21:14:58] halfak: yeah, with varying degrees. if we expose the plain redis protocol, then yes, for sure. if it is over websocket, we’ll need some ‘ack’ mechanism. [21:16:02] halfak: ok, I don’t fully understand. want to hit batcave to tell me? :) [21:16:32] sure. :) [21:20:50] do we have the staff meeting now or not? [21:20:51] or, in 10m, rather [21:26:17] leila, wanna meet? :0 [21:26:18] *:) [21:28:13] YuviPanda: agem, kafka for RCStream? only because it is my darling...:) [21:28:35] everything you are mentioning sounds like kafka [21:31:01] DarTar, for WikiGrok, version (a), should I expect any response other than 0/1 in the response table? In other words, should "not sure"s show up there? [21:31:17] nope [21:31:24] we only capture 0 or 1 [21:31:29] and for version b, we get 1 and NA? [21:31:41] yes [21:31:59] got it. in order to find out not sures for version a, we have to construct them using the other table right? [21:47:41] Ironholds, I can chat in 40 min. would that work? [21:47:51] sure [21:48:01] I'll let you know if I come out of the next meeting earlier [21:49:02] kk! [21:57:22] halfak: is this the phab instance we can use for testing? https://phab-01.wmflabs.org/ [21:57:40] Yes [21:57:58] kk [21:59:53] ottomata, table updated. [21:59:59] I'm digging into wikihadoop though. [22:00:07] oh ja? [22:00:09] for why? [22:00:22] I didn't get the raw XML test working yet. [22:00:44] ah ja [22:00:45] cool ok [22:00:48] that would be cool [22:01:21] How come there is no json-uncompressed? [22:01:23] hm, i should get some json snappy compressed data in there [22:01:29] there i json uncompressed, right? [22:01:30] Oh yeah. [22:01:31] ust nos snappy compressed [22:01:32] yeha [22:01:33] That one [22:01:46] you gotta create it ? you created the json ones, ja? :p [22:02:03] I suppose. I can do that. :) [22:02:08] cool, danke [22:02:19] halfak: ottomata that was nice :) I’ve been fairly, uh, ‘meeting starved’ of late :D [22:02:31] haha, that is an enviable condition! [22:02:38] Indeed. was nice to say hello face-to-face [22:02:46] yay for batcave! [22:02:53] milimetric, ^ [22:02:54] ottomata: indeed, have nothing other than the monday night ops meeting, where I maybe speak 3-4 sentences each time :) [22:02:55] actually, yeah, i like those types of meetings [22:03:04] yeah, these ones are nice :) [22:03:09] ottomata: kafka is scala, right? [22:03:12] yes [22:03:25] YuviPanda: I gave a kafka talk a monthish ago, if you are interested [22:03:41] ottomata: sure! [22:03:50] halfak: I’ve been thinking of writing RCStream+ in something that’s not python. [22:03:50] https://www.hakkalabs.co/articles/apache-kafka-wikimedia [22:03:54] sound is kinda bad [22:04:12] halfak: don’t know if I really want to write something that’s properly multithreaded in python [22:04:23] That makes sense. [22:04:49] halfak: might write in Scala, although my heart says C# [22:05:15] YuviPanda: SCALA! [22:05:26] but… but… there are no good IDEs! [22:05:29] I would very much like to learn it, every once in a while I start [22:05:39] IDEshmaideeEE [22:05:46] Boo to IDEs [22:05:59] halfak: well, if you’re writing something java based, you *need* an IDE [22:06:18] plus, C#’s parallelism stuff is *really good* [22:06:29] Oh. Fair point. I don't think I've written java without an IDE. [22:06:50] halfak: heh, in most colleges here people are taught to write Java in… notepad.exe. [22:07:14] end result being getting it to compile without any syntax errors gives you an A+, and most people don’t manage it after 2 years of classes... [22:07:21] also, no indenting. [22:07:50] halfak: ottomata but we already have scala in our infrastructure, and a fair amount of JVM stuff... [22:07:52] no indenting.... [22:07:53] and no C# stuff [22:08:05] halfak: yes, all the code I saw people write throughout school / college - zero indents [22:08:14] WHY [22:08:24] YuviPanda: I don't use an IDE! [22:08:26] for java! [22:08:46] halfak: well, nobody told them to, and A+ is getting it to compile, so A is getting it down to one or two errors... [22:08:49] syntax errors [22:08:55] halfak: and then these same people become professors... [22:08:57] cycle continues [22:09:08] ottomata: woah. so much typing. [22:09:12] ewww [22:09:29] sublime does a little bit of autocomplete [22:09:53] halfak: I’m sure I mentioned my networking ‘professor’ who proudly told everyone that ‘webmasters’ hate ‘caching’ because it reduces their ‘ hit counters’ and thus reducing ad revenue... [22:10:27] ottomata: hmm, so package names, etc? do you even remember those? [22:10:44] ottomata: I’ve become used to just typing a class name and IntelliJ automatically adding the import for me [22:12:07] nope, sometimes I copy/paste from online java docs [22:12:14] i also only do explicit imports [22:12:18] no wildcards! [22:12:35] yeah, I’ve flip flopped between ‘omg all the wildcards’ to ‘no wildcards' [22:12:38] currently at ‘no wildcards' [22:13:18] ottomata: halfak hah, just remembered/realized there’s no UDP involved at all in rcstream (only in the IRC thing). the MW servers directly put JSON in redis. [22:13:24] so no need to worry about that at all [22:14:24] halfak: also, rcstream hasn’t been ‘officially launched’ yet [22:15:32] redis: cool, no launch: does that mean we can't get this in front of people soon? [22:16:12] halfak: well, it essentially means if it breaks, there’s going to be nobody paged. [22:16:17] no priority support [22:16:29] Gotcha [22:16:38] That makes sense for the time being. [22:16:43] halfak: yeah [22:16:51] halfak: but of course, if we start using it… :D [22:16:57] :D [22:18:14] ottomata, I think you missed an excellent exchange in the dev sync-up earlier from Christian and I [22:18:19] halfak: anyway, I’ll probably end up using Scala for this. Will keep you updated. [22:18:26] Sounds good. [22:18:31] "so what are you using to write Java?" "Oh, just Sublime Text and the terminal" "AAAAH, you've been listening to Andrew!" [22:18:31] halfak: \o/ [22:18:39] re IDEs [22:18:48] haha [22:18:49] * YuviPanda wonders if qchris is an IntelliJ person [22:18:51] aaah [22:18:53] he just left :P [22:18:53] you should probably listen to christian [22:18:57] i could be converted i'm sure [22:19:01] to an IDE [22:19:05] ottomata: try IntelliJ! [22:19:06] noo, he wanted me to use Eclipse! [22:19:07] if someone showed me how to set it up. TOO MANY BUTTONS [22:19:09] ottomata: Eclipse is a piece of shit. [22:19:09] i've tried them all [22:19:11] TOO MANY BUTTONS [22:19:26] I sudo apt-get installed eclipse, looked at it, and immediately sudo apt-get removed [22:19:29] ottomata: mine has 0 buttons, and has an embedded vim in it [22:19:45] YuviPanda, congrats [22:19:55] you answered the question "how do you make intelliJ worse" successfully [22:20:00] "embed vim in it" [22:20:07] modal editing best editing [22:20:09] "now, even the interface you go to to get away from the shitty GUI sucks" [22:20:15] ha, YuviPanda, i bet you could convert me then [22:20:45] ottomata: indeed! :D It’s fairly trivial to set up as well - just hide all the goddamn toolbars, insteall the IdeaVim plugin, and DONE. [22:22:02] ottomata: I shall try the scala plugin too [22:23:02] halfak, you know the other advantage of hive? [22:23:07] for pageviews? [22:23:10] ok i gotta run, try to convert me tomorrow [22:23:17] o/ ottomata [22:23:26] we can go "WHERE wsc_pageviews() != prototype_pageviews()" and also SELECT the outcomes of that run [22:23:28] it's glorious [22:23:41] also, writing unit tests for hive is terribly duplicative but great fun. I love writing tests. I actively enjoy it. [22:23:48] How do those UDFs work without parameters? [22:23:53] Do they get to see the whole row? [22:23:57] "okay, let's think of all the ways some idiot like me could break software written by some idiot like me [22:23:59] oh, we'll have to include params, it was a silly example [22:24:15] but the point is we get actual pseudo-natural language elements. I love that stuff. [22:24:15] gotcha [22:24:16] :) [22:24:40] I think the new definition needs a name like WSC [22:24:49] iron_pageviews() [22:27:11] but it was a team effort! [22:27:13] ooh [22:27:18] can we call it rebel_alliance_pageviews [22:27:28] lol [22:27:51] I think we know who was the leader here :) [22:28:17] * YuviPanda murmurs about star wars references [22:28:19] not enough trek ones [22:28:30] some day, someone will put me in charge of server naming. [22:28:32] some day... [22:29:31] I like star wars. [22:29:34] I like it a lot. [22:29:53] halfak, okay, if I was a leader I get to name it ;p [22:32:20] Ironholds, I'm ready [22:32:43] Hangout? [22:33:35] totally [22:48:35] * halfak waits ages for wikihadoop to get on with it. [22:48:52] I could have converted the whole dump to json and processed it in this amount of time. [23:03:45] And then of course it fails! AHH. [23:15:32] halfak, leila and I dug into the distribution for that dataset [23:15:34] looks binomial! [23:15:47] so I'm going to use a Chi2 test given the sample size (100k for each group) [23:15:51] That, we already knew. :P [23:16:39] we did? [23:16:42] I thought you said biMODAL! [23:16:53] Oh! I misread [23:16:54] I've spent 48 hours tearing my hair out trying to work out WTF the two distributions are! :D [23:16:57] Binomial? [23:17:15] that's what Kolmogorov-Smirnov sez [23:17:30] Wut [23:17:36] That doesn't make any sense. [23:19:24] wait, hangon [23:19:26] you're right [23:19:40] urgh. I do not know enough to interpret this test. [23:19:54] leila, as a test I compared a normal distribution and a uniform distribution under the KS test [23:20:09] it looks like the "alternative hypothesis" is "these are not the same distribution" [23:20:15] bollocks. [23:20:55] * Ironholds cries [23:20:57] I give up on maths. [23:21:07] I'm going to be a Jedi Dev, like my father before me. [23:24:48] Ironholds, how did you do a KS? you should make a fit1 which is the distribution of the real data (fitdistr does that, for example), and fit2 is the control. [23:27:49] oh, I see what I did [23:27:55] I got the arguments the wrong way around. Fail whale. [23:29:10] okay, I give up on this, for this evening. I have a headache and cannot think. [23:33:00] Ironholds, have nap. Feel better. :) [23:33:33] a nap would mess up my sleep cycle, but I appreciate the thought.