[12:59:46] yooooooo [13:06:48] (PS1) Erik Zachte: new backup file for non-private squid log data, for now to stat1 [analytics/wikistats] - https://gerrit.wikimedia.org/r/79779 [13:10:26] is ottomata IN DA HOUSE??? [13:10:39] is qchris in THAAAA HOUSSSSSS? [13:11:21] is average in THAHATATATATA HOUSE HOUSE??? [13:11:30] is MILIMETRIC in TTTTTTHHHEEEEEEE HHHH--O-U-SSSSSSSSSSSSSS!?!?!?!? [13:11:45] hey qchris! [13:11:52] (CR) Erik Zachte: [C: 2 V: 2] Add details for language code tuv [analytics/wikistats] - https://gerrit.wikimedia.org/r/78826 (owner: Erik Zachte) [13:11:54] nice job with detecting the drop [13:12:10] there is a larger question here (i belief) [13:12:15] (CR) Erik Zachte: [C: 2 V: 2] exclude dates with faulty data [analytics/wikistats] - https://gerrit.wikimedia.org/r/78828 (owner: Erik Zachte) [13:12:31] (CR) Erik Zachte: [C: 2 V: 2] added rsync to web site [analytics/wikistats] - https://gerrit.wikimedia.org/r/78829 (owner: Erik Zachte) [13:12:34] Both Mobile and Ops tweak the varnish configuration quite frequently [13:12:47] We are being hold accountable for reliable data [13:12:51] (CR) Erik Zachte: [C: 2 V: 2] new backup file for non-private squid log data, for now to stat1 [analytics/wikistats] - https://gerrit.wikimedia.org/r/79779 (owner: Erik Zachte) [13:12:53] Yes :-) [13:12:59] but we are not part of these changes [13:13:03] nor planning [13:13:10] and we are the ones that detect issues [13:13:15] but cannot resolve them either [13:13:16] MOORning [13:13:20] in da haouuuuse [13:13:28] but very soon moving to DA CAFEEEE [13:13:51] drdee: Yes. That's true. [13:14:01] aiiigght I AM passing the MIC to Ottomata, coz he's got the yayayayyaamaataaa [13:14:52] qchris: it's more of an organizational question but something to think about (IMHO) [13:15:40] drdee: Yes. I fully agree. Remember when I pressed hard to get a response on �What's the authorative source� around Wikipedia Zero. [13:15:51] drdee: That was due to the same motivation. [13:16:03] drdee: We cannot fix those problems. [13:16:05] yup, well thats varnish at the end of the day [13:16:15] sure but we can start a conversation around these problems [13:16:25] and see if we can make things better (tm) [13:16:28] drdee: But I am pleased if at least detect the problems before we loose a month worth of data. [13:16:40] yahhh that's a nice improvement ;) [13:16:40] drdee: Yes, let's strat this conversation. [13:17:16] drdee: I am all for defining responsibilities of teams. [13:17:42] k, let me draft an email about this and i wil give it to you for feedback [13:17:47] ok? [13:17:49] Sure. [13:19:31] so what was the problem? [13:19:40] varnish configs had changed? [13:20:03] The mobile redirect was broken or something like that [13:20:06] hm [13:20:10] Let me grab the change. [13:20:19] https://gerrit.wikimedia.org/r/#/c/79563/ [13:20:42] But I am to fresh on the team to understand such things :-) [13:21:07] So feel free to correct me on card #1072. [13:28:41] oh seeing email thread [13:28:42] reading [13:29:47] yeah, very nice qchris [13:37:22] that was great work by qchris wasn't it? [13:41:43] ja totally [13:41:55] ok, emails conquered, o [13:41:59] i'm off to a cafe, back in a bit [13:53:52] (CR) Milimetric: "(1 comment)" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/79457 (owner: Stefan.petrea) [13:56:41] (PS1) Milimetric: making the report results backwards compatible [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/79787 [13:57:32] (CR) Milimetric: [C: 2 V: 2] making the report results backwards compatible [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/79787 (owner: Milimetric) [15:05:05] hi ! [15:21:57] hey average [15:23:40] hi drdee [15:24:00] milimetric reviewed your code, i think it's a small change that's required [15:24:06] any luck with the oauth bug? [15:24:48] just started today, will be looking at the oauth bug [15:25:02] saw milimetric 's review, will send a new patchset [15:25:46] cool, I'm just working on unit tests [15:26:01] k [15:51:42] average, did you ask qchris about the openddr tree file and where he got it from? [15:55:46] ottomata, what's the status with reinstalling the hadoop worker nodes/ [15:57:25] been heads down in Camus, so haven't looked at it [15:57:42] need to schedule with lesliecarr anyway, and she was still traveling a bit last week [15:57:47] we have ops meeting today, so i'll ask about it then [16:00:54] drdee, average: The 1.13 dtree file? I got it from drdee when I joined the analytics team. [16:01:27] ottomata, need to schedule reinstalling hadoop worker nodes with Leslie? [16:01:29] why? [16:01:42] qchris: mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm [16:02:37] qchris, average: then that file must have come from the initial debian dclass package [16:02:49] i don't know how else i would have gotten it [16:03:35] oh well, wait [16:03:38] we are talking about repaving right? [16:03:46] we put the whole hadoop worker node reinstall thing on hold [16:03:51] i just stopped working on that since we are going to repave [16:03:53] right? [16:04:59] what's the difference between repaving and reinstalling? [16:06:30] repaving means removing all data [16:06:45] reinstalling is a looong process of one data node at a time, and rebalancing the data [16:07:07] also repaving means kafka 0.8 [16:07:12] but we will keep the historic aggregated data? [16:07:29] we could load it back in, I guess, but i wastn' planning on it [16:07:34] mabye keep it on stat1002 like it is now [16:07:40] k [16:11:16] drdee: yeah, but the file is nowhere to be found in the repo [16:11:41] now I was looking over openddr_client.c , I was expecting to find an easy way of converting OpenDDR => dtree [16:11:44] no such luck so far [16:12:00] I was also expecting to see some xml parsing code, no trace of that [16:12:09] maybe it came from https://github.com/TheWeatherChannel/dClass/tree/master/dtrees [16:12:10] seems like upstream left it out for some reason [16:12:46] see https://github.com/TheWeatherChannel/dClass/commit/055300091256d99436c0b955b99522613bb19fa9 [16:15:35] drdee: different checksums [16:16:03] did you check all dtrees from TheWeatherChannel? it must have come from there [16:16:09] everything [16:16:13] it's not from that repo [16:17:58] i find that hard to belief, were else would we have gotten the dtree file from? [16:18:52] there don't seem to be any other sources [16:18:55] I don't know [16:20:11] user@user-K56CM:~/wikistats/where-dtree-comes-from/dtrees$ md5sum * [16:20:12] f494a49efe13e443a92b0fd1e9624337 openddr.dtree.commit.message.claims.1.13 [16:20:15] bdc444bc5df67afd946cebab1cb5628a openddr.dtree.from.qchris.on.email [16:20:17] user@user-K56CM:~/wikistats/where-dtree-comes-from/dtrees$ diff openddr.dtree.commit.message.claims.1.13 openddr.dtree.from.qchris.on.email | wc -l [16:20:21] 15693 [16:22:33] we took the https://github.com/TheWeatherChannel/dClass and you built the JNI around it, i don't' remember that we looked for any other dtree ever, i am pretty sure that we just took the one that was shipped with dclass [16:34:30] apparently dclass_client has the ability to turn OpenDDR => dtree [16:34:32] git rev-list --all | perl -ne 'BEGIN {`rm -f /tmp/o/*`;}; chomp; `git checkout $_; cp resources/*.xml /tmp/o; ../dClass/src/dclass_client -d "/tmp/o" -o /tmp/o/openddr.dtree; cp /tmp/o/openddr.dtree ../all-dtrees-from-openddr-official/openddr.dtree.$_ `; ' [16:34:43] this oneliner gets all revisions from the OpenDDR official repo [16:34:47] and turns them into dtrees [16:34:57] appending the commit version to each and every one of them [16:35:05] now it's time for the md5sums [16:36:09] no matches [16:36:32] the dtree we're using is bigger than all the official ones, and doesn't MD5-match any of the official ones [16:37:18] dtree we're using 2.3mb , official dtrees all are <= 2.2mb [16:39:02] did you create the dtree yourself manually? [16:39:25] I did not, I took the one that was there [16:44:17] was where? [16:47:21] it was on my local clone of dClass official repo [16:51:23] so then we know were the file is coming from, right? or am i missing something? [16:57:13] (CR) Stefan.petrea: "It's the result of a compilation process. The dclass_client binary is able to convert OpenDDR xmls into dtree file format." [analytics/dclass] (wikimedia) - https://gerrit.wikimedia.org/r/79453 (owner: Stefan.petrea) [16:58:28] drdee: yes, I added more details for Faidon in gerrit review [17:00:31] average: scrum [17:00:37] joining [18:04:02] (CR) Stefan.petrea: "It is Wikimedia-specific in that we were using it before and there are reports built using it, so we still need it in the new packages." [analytics/dclass] (wikimedia) - https://gerrit.wikimedia.org/r/79453 (owner: Stefan.petrea) [18:48:16] I have access to stat1 and wish to query the NavigationTiming_5336845 table; how do I get access to the MySQL server? [18:50:58] mwalker: check https://office.wikimedia.org/wiki/Data_access [18:52:37] drdee: right; but I need the password for the research user [18:52:52] 1 sec [18:53:00] that' was not the original question ;) [18:53:23] hehe; sorry -- I didn't know if I needed my own account on the servers; or if everyone just used research [18:53:27] so I was trying to be generic [18:54:14] pm'ed you [18:55:07] shiney; in and off to the races [18:55:10] thanks drdee, i was talking to him via pm too, wasn't sure of proper policy for handing that out [18:55:33] i think that's how we do it [19:42:37] (PS1) Stefan.petrea: Adding older dtree file [analytics/dclass] (debian) - https://gerrit.wikimedia.org/r/79911 [20:02:10] (Abandoned) Stefan.petrea: Adding older dtree file [analytics/dclass] (debian) - https://gerrit.wikimedia.org/r/79911 (owner: Stefan.petrea) [20:05:33] (PS2) Stefan.petrea: Adding older dtree file [analytics/dclass] (wikimedia) - https://gerrit.wikimedia.org/r/79453 [20:20:16] As ottomata just quit, can anyone provide me with the Oozie URL? [20:20:20] (analytics1027 responds to "oozie jobs" with "Oozie URL is not available neither in command option or in the environment") [20:30:36] milimetric: Do you happen to know the Oozie URL? ^ [20:31:17] hm, it used to be that we used an ssh tunnel [20:31:26] and used localhost:8888 [20:31:36] or whatever depending on the service we were accessing [20:31:43] but I think now oozie is somewhere else [20:31:51] Oh. Ok. Thanks. [20:32:01] I'll guess I'll have to wait for ottomata then. [20:32:03] yeah, sorry, I don't remember [20:32:11] np [20:32:24] but you can look at oozie through hue [20:32:27] (web interface) [20:32:33] Good to know I am not the only one who cannot connect to oozie. [20:32:40] Oh :-) [20:37:00] ottomata, I'd need to check if a job in kraken's oozie directory is/was in oozie, but "oozie jobs" on analytics1027 gives [20:37:12] "Oozie URL is not available neither in command option or in the environment", and Hue's connector to oozie [20:37:23] only shows two default jobs. [20:38:02] ottomata: Is there an easy way to see what jobs are/were in oozie? [20:38:14] ozie jobs -oozie http://analytics1027.eqiad.wmnet:11000/oozie [20:38:18] oozie jobs -oozie http://analytics1027.eqiad.wmnet:11000/oozie [20:38:35] cool. [20:39:02] That URL worked. Thanks ottomata [20:49:20] yup! [20:49:29] in the new cluster, after repaving and using new puppet modules [20:49:34] that url will be set for you [20:49:42] so you don't have to type it in every time [21:07:18] average it looked like you were busy with dclass but did you get a chance to finish up that test? [21:07:21] or do you need any help? [21:07:25] I'm wrapping up for the day [21:28:44] re kafka compression: are you guys dead-set on snappy, or will gzip do (initially)? [21:30:38] the snappy-c library in ubuntu is a bag of crisps. There is a better one by andi kleen but its not debianized from what I can find [21:30:49] not dead set, but snappy seems to be what people usually use [21:30:51] that or LZO? [21:31:01] hm [21:31:07] you only get to pick whats in kafka, otto [21:31:32] ha, eh? [21:31:39] i mean, so [21:31:47] we are definitely going to have to inspect the timestamp of the content [21:31:48] so [21:31:54] ohoh [21:31:57] you mean what kafka supports [21:31:58] right :) [21:32:00] heh [21:32:07] ! [21:32:15] so, yeah, we're going to have to look at the timestamp of hte content as we consume into hdfs [21:32:32] so we have to uncompress anyway [21:32:46] we can recompress back into hdfs in whatever we want [21:32:46] i guess [21:32:58] unless….we figure out that kafka key timestamp thing [21:33:03] that could be really awesome [21:33:14] *if* we can do that, then that would mean that gzip is probably not good [21:33:20] i think, i'm not entirely sure [21:33:34] i've read a bit that gzip + json + hdfs is a bit unseemly [21:33:39] since gzip isn't easily splittable [21:33:41] unless you use the kafka message key for timestamp (and partitioning, but thats what you want anyway, yeah?) [21:33:50] right, yeah [21:33:55] we should talk about that more [21:33:56] to see if that works [21:34:02] okay okay [21:34:04] i only have a loose understanding of how that would work [21:34:14] i think it will though [21:34:25] right now, what do you set as the key? [21:34:28] anything? [21:35:25] nothing, not using it. [21:35:27] it is optional. [21:35:33] right [21:35:34] hm [21:35:36] so conceptually its like this: application calls produce(topic, optional key, message payload). the key is used for partitioning and is retained with the message all through kafka. the consumer application is handed message tuples of { key, messagepayload } [21:35:57] aye, so, if I have the key right there with the timestamp in it [21:36:01] i can bucket it to whatever I want [21:36:07] now, compression in kafka is performed on messagesets, a bunch (1 or more) of messages. [21:36:08] without uncompressing or parsing the payload [21:36:18] ohoh [21:36:21] ok , continue [21:36:29] and the consumer library will uncompress the messages before handing to the application [21:36:58] so if we let kafka do the compression you will get the uncompression even if you dont want it. [21:37:31] and there is no point in compressing just a single message. [21:38:23] What is your main goal with compression? network load or storage? [21:38:24] hmmmm right totally [21:38:25] ok [21:38:39] not storage, we can deal with that [21:38:44] i thin its network load, but you'd have to ask mark to be sure [21:39:14] aye ok, so the compression then doesn't matter for us, because we get the values uncompressed no matter what [21:39:18] ok [21:39:24] the advantage then of the keys [21:39:28] is just so we don't have to parse the json [21:39:34] yah [21:39:45] it would be weird to set the timestamp as is in the key htouhg, right? [21:39:58] that would be basically an unlimited number of different keys [21:40:07] which is not really what that is meant for, right? [21:40:09] or does it matter? [21:40:13] I'd say so, unless you want to actually partition on the timestamp, which I think you mentioned? [21:40:27] keys have no meaning to kafka past partitioning, so it being unique or not is not an issue. [21:40:34] perhaps, but in that case we'd probably make our own partitioner, right? [21:40:43] num partitions % by the timestamp [21:40:47] or whatever? [21:40:50] yeah [21:41:00] i guess if we use the random partitioner doesn't matter anyway? [21:41:26] exactly, it will have the same affect, over time. [21:41:32] ja ok [21:41:33] hm [21:41:42] timestamp partition would tend to be quite bursty per partition and not very good for performance Id imagine [21:41:49] true, ja [21:41:57] all messages going to one broker for 1 second, next second another broker. [21:42:01] random is all we need, we don't need any semantic partitioning [21:42:32] you are still allowed to use the key though, you just make sure to use a partitioner that doesnt care about the key, thats all, like the random partitioner. [21:42:37] so, having the timestamp in the key doesn't hurt anything if we use random, but would allow us to bucket in hdfs wihtout having to parse the payload [21:42:51] yep :) [21:43:09] i know you love making things configurable……..how hard is it to make the key configurable, similar to the way the format is? [21:43:29] like, maybe using the same format string that you've already got there [21:43:35] :D [21:45:12] I could add "format.key = madness %t stuff", supporting the same stuff as "format" does. [21:45:40] that would be pretty awesooome, especially if one day we need something else for the key [21:45:50] that way we don't have to call you every time we change our minds :) [21:46:42] ooh, the peace.. [21:46:43] ;) [21:47:02] they key wont be used by any partitioner though [21:47:06] right [21:47:07] (that can be added later, of course) [21:47:15] it will if we configure brokers to do so, right? [21:47:20] or is that a producer thing [21:47:24] its on the producer [21:47:26] ohhhh [21:47:28] really? [21:47:29] hm [21:47:59] oh weird [21:47:59] hm [21:48:00] the producer must know which broker is handling what broker+partition combo, and send the message to that broker. [21:48:10] Snaps: debianizing andy kleen's library should not be unsurmountable objection, is the library itself good? [21:48:14] how's that supposed to work in other clients then? [21:48:21] like, C? [21:48:27] is there an API for partitioners in C? [21:48:33] did you have to write that in librdkafka? [21:48:55] ottomata: the client queries any broker for a full list of brokers and topics and partitions and then keeps that list updated. [21:49:09] librdkafka provides an API to let the application define its own partitioners, yeah. [21:49:40] hmm ok [21:49:42] aye [21:49:42] makes sense [21:50:34] drdee: I will give it a go and see if it fits the bill, but Ive good hopes since he's some kind of kernel hat [21:52:16] yeah, also, i think paravoid is on vacation for a bit [21:52:27] which will make our debian questions be open ended maybe? [21:52:29] for a little while [22:01:37] (PS1) Milimetric: added 10 tests, increased coverage, fixed issue with demo controller [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/79935