[00:15:27] mako, Just got back. What's up? [00:25:35] leila, FYI moiz was looking for you. [00:26:16] thanks, halfak1 [00:26:20] !* [00:41:56] halfak: so I had an idea. [00:42:22] what if instead of a C++ application, we just adapt the hive queries? Create views for each def, and then creatively JOIN them to grab the rows that appear in one but not the other. [00:42:33] It's a lot simpler and it means that when we find something that needs tweaking, we only need to update one thing. [00:42:48] +1 [00:42:59] awesome :). That's all I wanted to bug you about! [00:43:04] Sounds good. I thought the join might be intractable at the scale we are working with. [00:43:15] Also, I thought that working with a sample might be faster. [00:43:19] BUT you can sample in hive. [00:43:24] :) [00:44:51] yup! [00:45:00] also, we can just look at a particular period [00:45:20] "oh, there's a weird disparity in [hour]. Okay, CREATE VIEW... CREATE VIEW... SELECT * FROM... LIMIT 1000000;" [00:47:59] halfak, also, you'll love this: [00:48:06] some of my library tests failed when I uploaded to winbuilder [00:48:15] because they're at UTC+1 [00:48:26] I guess one other reason I wanted the utility is because I want to stream process views, but I can work that out from the HIVE query. [00:48:39] UTC+1? [00:48:43] Summer time? [00:48:56] naw, some of the unit tests involve converting timestamps to a numeric format [00:49:06] and, well....POSIXlt assumes local time. whoops. [00:49:18] example use case for stream processing? [00:49:57] Your local time isn't UTC+1 though. [00:50:10] Ironholds, generating session view triples [00:50:31] Actually, generating sessions at all. [00:53:01] oh! Gotcha [00:53:20] and yeah, mine isn't, and the test is set to check that as.numeric(ts) equals (numeric_value) [00:53:29] numeric_value is localised. Their locality != my locality. [00:53:37] the problem is the test is stupid. will rewrite. [00:53:56] Oh! [16:28:21] hey YuviPanda! I have a question re labs capacity [16:30:21] DarTar, should we start a Hangout to go over QA? [16:30:40] hey [16:31:41] leila: we can do hangout, I’m not ready yet though – schools are closed and the girls are at home, so everything is taking 3x longer [16:31:45] halfak: hey [16:31:47] halfak: sure [16:32:05] DarTar, np. I'll start some more testing then. Just ping me when you're ready. [16:32:24] leila: will do (I saw your notes in the pad) [16:32:36] * DarTar waves at halfak [16:32:44] + YuviPanda [16:32:53] * YuviPanda waves frantically at everyone too [16:33:02] I’m WFH this morning but will probably come in in the afternoon [16:33:17] * YuviPanda has been doing an audit of our databases, so… gnarly [16:36:22] YuviPanda, so, I have someone who is working on a service that needs some hefty CPU. [16:37:24] Labs is the right place for it. [16:37:33] So I'm not sure how to advise [16:37:46] halfak: define ‘hefty' [16:38:58] FaFlo, ^ [16:41:59] YuviPanda, my take is that this would be a single machine taking full advantage of 12 cores. [16:42:31] we have 8 core machines available, although not right now (next week, hopefully? new hardware is in DC, needs to be racked and set up) [16:42:37] so you can use a couple of those perhaps [16:42:50] quarry runs on 3 such machines, for example :) [16:43:07] hi, yeah, so I'm planning to set up a service that would need to do parsing of the full history dump of all articles and should do so on-demand for specific articles [16:43:21] that would probably be quite "hefty" :) [16:43:29] hmm, are you going to be storing it in an intermediate form? [16:43:39] Oh yeah. That bit too. [16:43:42] yes [16:44:01] so the intermediate results would also take some space [16:44:08] do you know what you’re going to be storing it on? [16:44:10] just disk? [16:44:16] for disk you have both NFS and local instance storage [16:44:21] latter caps at 160G [16:44:38] Wikipedia down? PHP fatal error :-/ [16:44:40] former’s pretty huge, but slower [16:44:46] well, I would probably need more than 160G [16:44:54] so that is that decision :) [16:45:08] There's also the postgres instance, right? [16:45:29] ah, yep. we’ve a postgres instance you can use for things as well. it’s running on raw hardware, so has more space / will be faster [16:53:12] halfak, are you guys safe and sound? [16:53:16] where are you working from? [16:53:22] Yup. At the office. [16:53:28] There's a little bit of rain and wind. [16:53:31] good. that's probably the safest place to be [16:53:33] No big deal. [16:53:35] :) [16:53:43] cool! happy to hear it. :-) [16:53:57] tch [16:53:59] pansies ;p [16:54:09] haha! Ironholds. [16:54:15] * halfak offers brofist to Ironholds [16:54:36] BTW, roan just stopped by to talk weather in the netherlands. :) [16:54:48] Apparently, they are hard core. [16:55:10] oh yeah, the dutch are CRAZY [16:55:17] halfak, you know how they resisted invasion for almost a century? [16:55:32] if anyone invaded they opened their dams. [16:55:34] By being the invaders [16:55:48] oh [16:55:51] There's a famous exchange between the King of Prussia and the Queen of the Netherlands in which Prussia boasts that his soldiers are seven feet high [16:56:02] and she points out that if he invades, they'll need to add an extra two feet to breathe. [16:56:08] heh [16:56:20] So, they threaten to flood their own land. [16:57:01] yep [16:57:10] On the subject of which, I just received the weirdest request from a dutch person [17:01:25] * halfak waits for Ironholds to continue [17:02:29] halfak: greetings [17:02:42] Hey mako! [17:02:48] Sorry to miss you yesterday [17:03:00] halfak, oh, sorry [17:03:15] In exchange for help with the R code their PhD is powered in, they're offering to let me stay in their house over Christmas [17:03:21] halfak: it's fine, i had a faculty meeting (hiring decision, went very long) and then a research group meeting [17:03:33] Ironholds: that sounds like ‘hey can you help me with this over christmas?' [17:03:35] halfak: the latter was at a brewery and happily went long :) [17:03:40] Ironholds, is it a cool house that's near where you want to be? [17:03:54] mako, makes sense :) [17:04:03] halfak, it's in the Netherlands [17:04:23] I dunno. Maybe you want to go hang out in the Netherlands over the holiday. :P [17:04:52] true! [17:04:58] can you get a visa that quic….oh wait [17:09:47] YuviPanda, yes :D [17:13:52] DarTar, do you know what's the lag before an event occuring and the corresponding event actions showing up in MobileWebWikiGrok_10352247 ? [17:14:17] leila: there should be no lag, if replag is under control. Are you on s1 or big box? [17:14:29] replag is usually shorter on big box [17:14:31] s1 [17:14:42] brb [17:15:22] actually, sorry, I'm in analytics-store.eqiad.wmnet, not s1 [17:16:56] (you can check lag here: https://tendril.wikimedia.org/tree analytics-store is dbstore1002 [17:16:56] ) [17:26:48] hey YuviPanda, https://tools.wmflabs.org/xtools/articleinfo/index.php (linked from all page history pages) appears to be down [17:27:36] (you can check lag here: https://tendril.wikimedia.org/tree analytics-store is dbstore1002 [17:27:48] DarTar: restarted it [17:28:00] YuviPanda: thx [17:28:51] leila, halfak: I’ll be a few mins late for RG, getting the girls ready for the day [17:30:25] np DarTar [17:30:33] kk I'm running late too. Lost track of time. [18:38:00] DarTar, I'm in the Hangout [18:38:13] leila: k, give me 1 min [18:38:19] sure. [19:33:14] hey — was there a bug that impacted article counts fixed recently? [19:33:23] halfak: was I talking to you about this? [20:05:59] tnegrin, sorry to miss your message. I was out to lunch. [20:06:06] I think that Oliver was digging into that. [20:06:10] *Ironholds [20:11:37] IH|away: are you around? ;-) [20:12:14] J-Mo, do you have Mac or Linux? [20:13:39] Mac [20:22:13] halfak: no worries — not urgent [20:22:44] we got a note from a norway community member about a drop in articles — wondering if they were related [20:26:58] tnegrin, seems likely. [20:27:13] halfak: hiya! [20:27:19] where is that ironholds??? he’s not allowed to leave his computer [20:27:59] hey ottomata. [20:28:11] I actually have a job running right now on some simplewiki JSON. [20:28:13] so ja, hm, so sometimes revisions are out of order in the xml! [20:28:14] sheesh! [20:28:16] I think we should use that to test. [20:28:19] hmmm [20:28:26] simplewiki json? oh revisions from that? [20:28:32] Yup [20:28:41] how big? I would like it if this test was on a lot of data [20:28:44] at least a few gigs [20:29:22] e.g. if you point me to an xml dump file(s) i can start from those and convert them [20:29:26] It's a few gigs compressed. [20:29:30] perfect [20:29:34] We don't want to do much more. Diffs take a long time. [20:29:37] yeah, that is plenty [20:29:39] * halfak is getting path [20:31:21] /user/halfak/hadoopin/diffengine/simplewiki-20141122-pages-meta-history.streamed.json.bz2 [20:31:37] Looks like it is just 1.2GB [20:31:49] Still. I have been running on it for more than 24h to make diffs. [20:32:18] 1.2G is fine [20:32:21] compressed [20:32:42] ok, i am going to convert that to avro snappy and avro bz2 [20:32:50] kk. [20:32:51] then we will have some base data to compare jobs on [20:32:57] any changes you want me to make to avro format before I do? [20:33:09] So for speed tests, how will we deal with cluster load? [20:33:12] https://gerrit.wikimedia.org/r/#/c/171056/4/refinery-core/src/main/avro/mediawiki/RevisionDocument.avsc [20:33:16] yes. page.page_namespace_id [20:33:23] is silly [20:33:26] ha, halfak good q, it will make our stuff non scientific, for sure [20:34:10] this is just a rough idea for now [20:34:11] oo, also good: since you are already running this job on this specific file, we already have one of the tests down [20:34:30] do you also ahve that file converted to json? [20:35:09] halfak: shall we just call it namespace then? [20:35:27] That file is JSON. [20:35:41] and yeah. I think we should just call it "namespace" [20:35:42] oh! [20:35:47] If we flatten, "page_namespace" [20:35:48] oh sorry, didn't read filepath [20:35:53] ok [20:36:19] um, do you know the original .xml.bz2 file(s) it came from? [20:36:25] Otherwise, schema looks good. [20:36:30] Yes. Sec. [20:36:47] I'll put it right next to the json in a few minutes [20:37:38] ok awesome [20:37:41] danke [20:39:04] /user/halfak/hadoopin/diffengine/simplewiki-20141122-pages-meta-history.xml.bz2 [20:39:14] danke [20:39:26] be back in a few minutes. [20:46:37] k, running job to convert to avro.snappy [20:46:43] after that will create avro.bz2 [20:47:24] i'm saving the output into that directory under similar names: e.g. /user/halfak/hadoopin/diffengine/simplewiki-20141122-pages-meta-history.avro.snappy [20:47:29] (that will be a directory though) [20:55:26] kk [21:00:38] hm, maybe you are right, the cluster is too busy to do experiments. I think I am loading it down with my enwiki xmldump conversion [21:00:42] maybe I should just stop that for now? [21:00:45] halfak: ^ [21:00:53] especially since we are still changing formats, etc.? [21:03:12] ha, already about 190G of snappy avro data done :) [21:04:02] leila: yt? I’m on the call with legal [21:04:15] yes, DarTar. you wanna add me? [21:04:28] I don't have an event for it [21:05:20] ottomata, running enwiki through wikihadoop made the cluster unusable the last time I tried. [21:06:26] DarTar, are you adding me? [21:06:27] ha, i am doing it now :) [21:06:37] but [21:06:40] halfak, i'm doing it all at once [21:06:48] leila: done [21:06:53] i'm looping through each .bz2 file at a time [21:06:57] doing one job for each [21:06:58] thanks! [21:07:13] i've actually never been able to launcha wikihadoop job on a full enwiki dump all at once [21:07:16] ottomata, anyway we can see how much CPU time was used for finished jobs? [21:07:18] it spends days planning splits [21:07:24] aye [21:07:29] ottomata, last time I tried, it took about 24h to do the splits. [21:07:37] i've seen it take longer, but ja [21:07:42] that is silly, i think [21:07:55] i suspect that is a bz2 problem [21:08:10] since bz2 is so intensive, and I don't know how it calculates splits in bz2, but it must do something with it [21:08:24] bz2 splits natively in hadoop [21:08:34] yes, but i dunno how hadoop calculates it [21:08:38] gotcha. [21:08:41] * halfak shrugs [21:08:44] the splits* [21:08:44] yeah [21:08:51] it shouldn't take 24 hours just to PLAN a job [21:09:23] heh. [21:09:51] hadoop: "I'ma scan everything 3 times and still duplicate revisions sometimes :D" [21:10:06] haha [21:10:54] ja ok, i'm going to stop my enwiki conversion [21:12:25] It would be great if we could figure out a better way to convert xml dumps. [21:12:35] It takes way too long to do it on a single machine. [21:13:13] well i mean, that's what this avro thingee is for :) [21:13:28] it does use wikihadoop [21:13:30] but meh? [21:13:31] meh. [21:25:18] ottomata, should I assume that the avro input will be trivially convertible to a json doc? [21:27:52] halfak, if you use hadoop streaming, it will come to use as json [21:27:58] to you* [21:28:00] :) kk [21:28:10] the AvroAsTextInputFormat does that [21:28:25] it'll look exactly like I pasted in that email [21:28:37] that was a line i took from hadoop streaming running on avro data [21:28:45] python file that just did print(line) [21:28:53] great. [21:29:07] You said there were blank lines too. [21:29:11] Any idea what's up with that? [21:29:19] yeah, not sure what that was about, maybe just a bad way I was doing it? not sure [21:29:25] i just saw that when I cat the file out from hdfs [21:29:36] dunno if you will actually get blank records in your mapper [21:29:43] don't think you should [21:29:48] It could be that when you do "for line in sys.stdin" you get a "\n" at the end. [21:29:59] hm, yeah, that is probably it [21:30:01] i am not stripping [21:30:05] And print() adds it's own [21:30:07] it was a naive little print() :) [21:30:27] i jsut wanted to verify that it was data that would make you happy :) [21:30:29] heheh [21:45:23] arg. ottomata, our schemas have that redirect_title difference. [21:45:27] Stupid XML format [21:45:35] What an annoying little detail! [21:45:55] ah we flattened [21:45:59] i can change it back if you want....:) [21:46:22] prob not going to run conversion tests today, so doesn't hurt to rerun the conversion again before I leave [21:46:23] Na. I think this is better. It means that I'll be writing a script to rename one field and re-write the data though :( [21:46:36] in json? [21:46:41] can't you just reconvert again? [21:46:43] dump2json? [21:46:45] that one file? [21:46:55] I have 99% of the enwiki dump converted. [21:46:58] OH [21:47:00] :/ [21:47:08] So, I'll just include this in my stream job./ [21:47:15] aye, checking for the field? [21:47:19] if this or that? [21:47:32] e.g. "dumb2awesome | json2diffs" [21:48:09] dum2awesome will look for a weird "redirect" object and convert it [21:48:18] And I'll just use that until we convert a new dump. [21:51:03] * halfak feels better [21:52:28] hoookay [22:57:52] Deskana, do you use Chrome or Firefox? [22:57:59] leila: Chrome. [22:58:10] in my Chrome it works, too. I'll reply [22:58:58] leila: Ah, beat you to it. [22:59:51] :D [23:00:22] why is your Adblock so old Deskana? ;-) [23:00:34] I don't update things I guess. :P [23:00:48] why would you, when things work I guess. ;-) [23:03:39] Deskana, did you see Nemo_bis email? [23:03:55] I didn't know there are varieties to it. [23:06:59] bleeh [23:09:05] blooh [23:12:57] leila: I did, but I naively expected it wouldn't make a difference. [23:13:19] it seems it doesn't, Deskana, on Chrome. [23:13:27] Ah, well, not so naive then. ;) [23:13:41] Deskana, how bad is it if I send screenshots to analytics@ [23:13:49] I have a feeling folks will hate me. ;-) [23:13:51] Upload them to imgur to be safe [23:13:53] That's what I do [23:13:56] +1 [23:13:58] for imgur [23:14:27] hokay. something to learn. ;-)