[00:03:25] no, def not. [00:03:59] the pig UDF is currently doing that -- freeing and reloading on every record. [00:04:01] ^^ average_drifter [00:04:11] i came up with a clever hack for the unloading problem [00:04:24] my static reference deallocates itself via finalize :) [13:28:33] ottomata: hi [13:28:34] ottomata: when was tab-separation introduced ? [13:28:42] somewhere in february ? [13:28:56] january 25 [13:28:57] ? [13:29:06] according to this e-mail [13:29:07] i think aroudn feb 1 [13:29:15] [Analytics] RFC: Tab as field delimiter in logging format of cache servers [13:29:47] jan 31 [13:29:47] https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log#January_31 [13:30:58] morning [13:35:49] drdee: morning [13:36:09] morning average_drifter: how is the report generation moving along? [13:36:31] drdee: smoothly, will roll out november => february very soon [13:36:35] drdee: in the meantime http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r38-kraken-logic/pageviews.html [13:36:49] drdee: which is 08 => 12 2012 [13:36:56] bump doesn't show [13:37:15] values for english are close to wikistats ( http://stats.wikimedia.org/EN/TablesPageViewsMonthlyMobile.htm ) [13:37:57] k [13:38:35] after that, I'll have to figure out a way to export the data for wikistats and limn [13:41:15] just wikistats [13:41:16] not limn [13:41:18] (for now) [13:47:12] ok [13:55:22] hmm, we should maybe have wm-bot here put out gerrit changes to analytics/* here [13:55:45] shall I ask ^demon for it? [13:55:58] (unless you guys consider that spam) [13:57:35] i like it [13:57:44] ottomata, milimetric,what ya think ^^ [13:58:57] sure [14:04:44] mmm... kinda spam for me, but i'm ok with it if you guys are [14:06:57] i think it's good to improve visibility and reducing manual copy/paste [14:07:34] average_drifter, do you know if you and dschoon figured out the pig memory issue? [14:08:00] i think they figured out the cause and that dschoon is working on a fix [14:08:24] k [14:08:29] yeah, that was nasty [14:08:47] did dschoon stop his job for now? i'd like to run mine :) [14:09:14] check job server :) [14:09:33] I'm curious to see his fix because I'm sure we're not the only ones facing this [14:11:07] there are plans from the original dclass authors to reimplement it in java [14:12:57] https://gerrit.wikimedia.org/r/#/c/55057/ [14:13:07] cool, he suspended them [14:13:09] well, as far as I can tell, that wouldn't help drdee [14:13:30] i am pretty sure it would :) [14:13:33] the problem is how to get pig to share a single instance of the object that does work [14:13:53] whether that instance is a wrapped C object or plain Java, it doesn't matter [14:14:14] what matters is that it's loading 2MB every time it's created [14:18:51] drdee, at the end of that nginx udp2log ticket [14:19:16] my last question was if we should add X-Forwarded-Proto as a log field [14:19:24] to be able to detect ssl requests [14:19:36] in the meantime, I think we should stop logging from the nginx servers all together [14:19:47] we can choose to turn on logging of X-Forwarded-Proto when we need it [14:19:56] whatcha think? [14:20:13] let me think about it while i am getting my coffee and bread :) [14:20:34] dschoon: what is the status of the datasource metadata blurb feature? [14:23:38] but not logging ssl sounds like the wrong solution; more and more traffic is routed through ssl [14:25:09] but its all duplicate anywa [14:25:09] y [14:25:15] ssl is just aproxy [14:47:40] drdee, i know i've asked you this before [14:47:43] http://localhost:19888/jobhistory/logs/analytics1015:8041/container_1363811768346_0283_01_000002/attempt_1363811768346_0283_m_000000_0/stats [14:47:45] why is my job dying!? [14:47:48] it works when run as pig [14:49:15] hmm, same memory one? [14:49:15] http://localhost:19888/jobhistory/logs/analytics1015:8041/container_1363811768346_0284_01_000002/attempt_1363811768346_0284_m_000000_0/stats/stderr/?start=0 [14:49:16] hmm [14:49:17] i dunno [14:56:29] which user are you using to run this job? [14:58:10] ottomata: i do see 2013-03-21 14:43:50,309 INFO [Low Memory Detector] org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call- Usage threshold init = 139853824(136576K) used = 104857616(102400K) committed = 139853824(136576K) max = 139853824(136576K) [14:58:19] AH rats [14:58:19] so i think it's the same old memory issue [14:58:22] HM [14:58:26] but i'm not using dclass [14:58:33] this is the webrequest loss script [14:58:41] i'm trying to revive it [14:58:45] right so that should be fixable by tuning the jvm [14:58:53] but this job runs in pig just fine [14:58:55] just not in oozie [14:59:11] where di you see that mem error? [15:00:16] http://localhost:19888/jobhistory/logs/analytics1017:8041/container_1363811768346_0284_01_000022/attempt_1363811768346_0284_m_000020_0/stats [15:12:46] ottomata, about the nginx servers: how about sending their traffic to a separate udp2log instance so it does not pollute the packet loss metric but we still have the data in case of troubleshooting [15:35:46] hm [15:35:49] that'd be fine I think [15:36:33] I'd like to say I love that this got into mingle: https://mingle.corp.wikimedia.org/projects/analytics/cards/113 [15:37:48] ottomata: yeah, i did. [15:38:00] the static singleton fixed it [15:38:16] i suspended all the device jobs last night [15:38:28] i need to migrate them all to 0.0.2 [15:38:32] hm, i'm still haveing my same error :/ [15:38:45] i think the concat script eats memory [15:38:49] try running without it [15:40:28] you're talking about webreq hourly loss, right? [15:40:32] oh hm, yeah the pig script ran without it find [15:40:34] that must be it [15:40:43] wow, nice [15:40:43] hmmmm, but it said my loss .pig action was the one that failed [15:40:44] lemme try [15:40:57] dschoon: so no more low memory errors now ? [15:41:06] that's because it reuses the jvm, i think [15:41:37] this actually sounds like a great time to try out ambrose [15:41:47] ottomata: https://github.com/twitter/ambrose [15:41:56] i don't know if we can launch oozie jobs with it [15:42:02] but it should be possible. [15:42:05] it's just a pig wrapper [15:42:45] ah! [15:42:49] milimetric: I'd like to say I love that this got into mingle: https://mingle.corp.wikimedia.org/projects/analytics/cards/113 [15:42:52] scheduled for release in pig 0.11.0 [15:42:53] hot [15:43:03] i think we should tackle this in amsterdam, IMHO [15:43:06] yeah, i've had that on my todo for a long time [15:43:12] launch jobs? [15:43:13] that'd be awesome drdee [15:43:16] i thought it was just a pretty resourc emonitor [15:43:23] we already promised limn - mediawiki goodness though [15:43:27] it shows the stages [15:43:51] so maybe try launching JUST the concat script with ambrose? [15:43:58] for amsterdam? [15:43:59] yeah, milimetric [15:44:06] we ought to start prep for that, heh [15:44:10] yes drdee [15:44:10] it's in may [15:44:33] mmmm, i don't remember that at all [15:45:01] our Limn MW Ext sprint got accepted. [15:45:02] i'm not sure if it was made public, that's what we applied for travel funding with [15:45:09] dschoon, naw job still failed [15:45:11] same reason [15:45:13] even without concat [15:45:14] WAIT [15:45:16] ? [15:45:18] really?? [15:45:19] :) [15:45:21] he said wait [15:45:22] Pig version 0.10.0-cdh4.1. [15:45:23] 2 [15:45:26] ! [15:45:26] >>>>.... [15:45:28] seeeee [15:45:29] okay [15:45:38] hadoop version 2.0.0-cdh4.2.0 [15:45:38] I totes poked you about this [15:45:40] the reason is [15:45:45] about pig version? [15:46:03] oh, i think i opened a thing in asana to remember to do something wiht it [15:46:08] look in /user/oozie/libs [15:46:16] you'll find a bunch of jars from CDH 4.1.2 [15:46:20] OH [15:46:24] they need to be updated, but i have no idea with what [15:46:26] i remember you and drdee talking abou tthat [15:46:27] OHHHHH [15:46:27] right [15:46:28] oh [15:46:29] i know [15:46:51] we should make a checklist for upgrading CDH* [15:46:52] https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-ConfiguringOozieafterUpgradingfromanEarlierCDH4Release [15:47:00] this is what we get for doing an unplanned upgrade [15:47:07] cdh4 has pretty good docs on it [15:47:14] i just didn't follow them because half were already done [15:47:22] i should go through them and make sure i got everything [15:48:03] ahh [15:48:03] nice. [15:48:09] that'd be great. [15:48:18] hooray for stable environments. [15:48:47] gm [15:48:55] howdy [15:52:03] Just a note: I updated the WIP features view to include items that were in shipping for last sprint, so we don't forget to ship what we showcased. [15:52:53] you know, details. [15:54:01] kraigparkinson: i see limn puppetization in WIP now, should I work on that? [15:54:33] ottomata, is there anything left to do for #68, #154, or #244 in Shipping? :) [15:55:29] ooh, kraigparkinson, and don't think the general cleanup of the tabs and "favorites" has gone un-noticed :) [15:55:29] haha [15:55:30] "f you are upgrading a cluster that is part of a production system, be sure to plan ahead. [15:55:32] " [15:56:03] but kraigparkinson, I don't see 68, 154, and 244 in shipping... [15:56:25] milimetric, I've been trying to make the names more consistent. hope its useful. keep the feedback coming. [15:56:33] milimetric, refresh the page... [15:56:53] :) I'm a web developer. I Always refresh the page and clear the cache like 3 times before saying anything [15:56:56] milimetric, I added our stuff from last sprint to the view so we don't forget. [15:56:57] it's why I'm so slow [15:56:57] ayy, dschoon, the workflow succeeeded this time [15:56:59] lol [15:57:17] ottomata grats [15:57:24] hooray for compatible software [15:58:10] yeah, i think it's a good idea to keep last sprint's cards, but I don't see them on the WIP - Features tab. [15:58:16] dschoon, ottomata, do you see them? [16:09:25] --> office [16:46:21] ottomata: are the nginx loss stats garbage, or are we legitimately getting loss from there. I'm assuming we never fixed the problem we had, and they're still garbage [16:46:26] i'm pretty pleased the singleton trick worked for dclass [16:46:34] what's this? [16:46:38] i don't know, hm [16:46:43] i assumed they were garbage [16:46:46] oh, nginx? [16:46:51] they're garbage, last i heard [16:47:04] yeah but has anyone actually checked? [16:47:05] if this is the multithreaded seq numbers in nginx [16:47:09] yes it is [16:47:25] what changed in October on locke, though? [16:47:26] no, but i recall a conversation where tim or mark mentioned we never got around to fixing it [16:47:44] drdee, can you comment on faidon's update to the RT ticket? [16:47:46] there is a untested patch by fadion [16:47:50] http://rt.wikimedia.org/Ticket/Display.html?id=859 [16:47:56] yes i will [16:48:05] danke [16:48:06] I'm not sure I buy the theory that the nginx logging is what is causing the loss stats now [16:48:31] we would have seen constant loss for the past two years, and that's not what we're seeing [16:48:33] robla: look at erikz charts that he send, the packet loss is really coming from the ssl* boes [16:49:02] ottomata: did you update /user/oozie/libs? [16:49:13] also look at the attached csv file [16:49:17] drdee: have you looked directly at the loss logs on locke itself? [16:49:20] er, /user/oozie/share/lib [16:51:33] I'm pretty sure that Ben's monitoring script already filters out the ssl stuff [16:57:14] we need to confirm this first then :) [16:58:50] dschoon: you read the source code recently, do you remember anything that was related to handling ssl traffic? [16:59:00] sec [16:59:49] kraigparkinson: so the tiny little refresh icon on the tab has a meaning outside of "refresh the browser" [17:00:38] yes. [17:00:43] to milimetric, heh [17:00:47] stack based :( [17:00:55] drdee: i do not, but i will look again [17:01:02] i don't know the context of this discussion though [17:01:06] so let's talk about it in scrum? [17:03:53] for ssl filtering, I'm not referring to the udp2log C source code. I'm referring to the monitoring scripts that I believe Leslie and Ben wrote. [17:04:46] it's a weird little python thing that calculates the packet loss by parsing the packet-loss.log file [17:12:43] dschoon, oozie, yes [17:15:13] robla, in the PacketLossLogtailer, I don't see exceptions for ssl servers, only for excluding percentloss averages are greater than 98%, and if error margin is > 20 [17:15:20] i have examined packet-loss.log on locke [17:15:26] ssls look like this: [17:15:26] [2013-03-21T17:10:01] ssl1 lost: (70.94230 +/- 3.47306)% [17:23:43] does anyone know where the source for our udp squid logging patch lives? [17:31:03] come back! [17:31:04] ottomata: ! [17:31:06] oh [17:31:38] for the record, squid patches are here: https://gerrit.wikimedia.org/r/gitweb?p=operations/debs/squid.git;a=tree;f=debian/patches;h=b9b7ea7bda652bffa17b83f10f348ecba12718a8;hb=HEAD [17:37:03] re: the Mingle view "Features with Questions" are there still open Qs? or can we remove the tags? https://mingle.corp.wikimedia.org/projects/analytics/cards?favorite_id=747&view=Features+with+Questions [17:42:38] sorry! [17:46:57] ottomata, dschoon: can you both have a look at https://mingle.corp.wikimedia.org/projects/analytics/cards/155 and update if you see anything missing? [17:47:15] ottomata, can we still setup the test env. in light of mark's comments? [17:48:05] drdee, yes, but we may have to switch to consuming the multicast stream instead of a direct unicast one [17:48:09] shouldn't make a difference for testing purposes [17:48:23] ok [17:48:36] i will add that as a user acceptance criteria, okay? [17:49:15] what? [17:49:24] consume from multicast? [17:49:25] maybe [17:49:27] asher told me not to [17:49:32] now mark and asher need to agree [17:49:49] no, setting up the test env. to determine load profiles for the udp2log machines [17:53:01] oh sigh, ok [18:01:30] ottomata, dschoon: Spare capacity of upd2log machine metric: https://mingle.corp.wikimedia.org/projects/analytics/cards/425 [18:01:46] dscoon is in a meeting [18:01:50] dschoon [18:12:05] brb [18:33:20] awesome [18:33:22] ty drdee [18:33:33] will review all this stuff in a few -- starved [18:33:35] need foods. brb [18:47:36] kk [18:56:39] hey dschoon, when you have a sec, could we look at the whole jar hell ssh hell thing? [18:56:49] yes. [18:57:00] maybe 15 or so? [18:58:14] ok [18:58:23] i'll ping you at 15:15 [19:01:20] kraigparkinson: https://plus.google.com/hangouts/_/890cfc26fefa88c6ab97f75eda8c19cc5cdb4fc6 [19:01:46] a few graphs of note: [19:01:47] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=Packet+Loss+90th&vl=%25&x=&n=&hreg%5B%5D=(locke%7Cemery%7Coxygen)&mreg%5B%5D=packet_loss_90th>ype=line&glegend=show&aggregate=1 [19:02:24] DarTar: https://plus.google.com/hangouts/_/890cfc26fefa88c6ab97f75eda8c19cc5cdb4fc6 [19:02:30] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=Packet+Loss+Average&vl=%25&x=&n=&hreg%5B%5D=(locke%7Cemery%7Coxygen)&mreg%5B%5D=packet_loss_average>ype=line&glegend=show&aggregate=1 [19:02:35] i'll add those to the ticket [19:02:37] ottomata: ^^ [19:03:03] those are graph groups for oxygen|locke|emery for packet loss avg and 90th [19:03:56] NICE! [19:03:59] this graph in particular bears out what ezachte was saying: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=Packet+Loss+90th&vl=%25&x=&n=&hreg[]=%28locke%7Cemery%7Coxygen%29&mreg[]=packet_loss_90th>ype=line&glegend=show&aggregate=1 [19:04:08] you see a huge increase after oct [19:04:26] we should try to correlate that with a deployment [19:04:34] yeah [19:04:35] 90th totally [19:04:42] that's the ssls i'm pretty sure [19:04:43] check this [19:04:45] https://gist.github.com/ottomata/5215342 [19:04:47] i think before that, tho, we need to precisely know what that number means [19:04:49] hangout? [19:04:52] sure [19:04:54] i know what it means [19:04:57] yeah [19:05:01] that script is in puppet [19:05:03] drdee, dartar, rfaulkner: https://plus.google.com/hangouts/_/890cfc26fefa88c6ab97f75eda8c19cc5cdb4fc6 [19:05:06] ahh [19:05:07] nice [19:05:09] aggregates [19:05:13] in scrum [19:05:29] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [19:05:30] k [19:05:31] 1sec [20:28:16] dschoon, any time for me? [20:28:23] soon -- sending email about packetloss and then it's youuu [20:28:27] are you blocked atm? [20:28:27] k [20:28:37] a little yea, ran out of side projects [20:28:44] and emails [20:29:02] I do have a bunch of articles about twitter you guys sent months ago :) [20:29:18] okay. [20:29:21] <10m [20:29:23] hopefully :) [20:50:19] dschoon, how much of the data that we just talked about are you writing up? [20:50:35] i'm adding links to the graphs [20:50:38] i'm writing an email that summarizes the data from my findings and ezachte's coorboration [20:50:40] almost done [20:50:42] k cool [20:50:43] k [20:50:51] gimme 5 and you'll see [20:52:49] k [20:53:05] dschoon, drdee, should we move this discussion over to ops or analytics list? [20:53:09] there's no reason to have it private, right? [20:53:16] good point. [20:53:27] i'll make mine an email to both [20:53:31] k danke [20:53:34] (as well as cc the old recipients) [20:53:37] cool [20:59:15] so where does one go to find information about the mediawiki database table schema? [20:59:18] answer: http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png [21:06:41] dschoon I am itching to press send! [21:06:41] hehe [21:06:42] :) [21:20:37] sent [21:20:44] ottomata, check your email :) [21:21:52] kraigparkinson: I'm going to update the card with that email. [21:23:28] dschoon, amen [21:26:44] kraigparkinson: ottomata and i also noted some potential followups to investigate/improve monitoring around this [21:26:58] what kind of card(s) should i make for those? [21:27:31] cool, me sent too! [21:28:04] bleh [21:28:11] did my message go through? [21:28:13] yes [21:28:20] it said some crap about moderator approval [21:28:25] speaking of -- who is the moderator? [21:28:26] drdee: ? [21:28:27] oh but you cced me [21:28:29] yeah i got that too [21:28:34] i dunno [21:29:48] kraigparkinson: could you look into that? [21:29:58] seems like something you should also have power over [21:30:22] tho, really, it might be prudent to make all of us list moderators, just to avoid any particular person having to be around [21:34:28] gusssyysysysy, i'm outtty [21:34:31] lata [21:43:10] dschoon, just catching up [21:44:11] dschoon, re cards: go ahead and use your best judgment re: feature vs improvement idea vs problem [21:44:31] I have a process for checking those before they get put into any backlog, so we'll realign if needed. [21:44:32] k [21:44:42] re: moderator approval, wha? [21:44:49] just wanted to check, as there are now many options [21:44:52] the analytics list is moderated? [21:44:56] you have to be a member [21:45:07] and it holds things on a bunch of criterion [21:45:15] (like many recipients, which is what got my message) [21:45:44] estraño [21:46:16] ok, I'll look into who's on it and what the general policy is for who's a moderator. [21:46:29] who do we know it is at the moment? [21:47:30] is this a mild annoyance or a major pain? [21:47:41] (salutes) Major Pain! [21:58:06] dschoon ^^ [21:58:25] mild [21:58:31] I think it's drdee atm? [21:58:35] No idea how to find out. [21:58:43] just curious what you knew [21:58:44] thanks [22:02:01] dschoon: question about zero files [22:02:34] dschoon: for some reason, there aren't any requests coming from malaysia [22:03:10] all the digi malaysia requests are being tagged as US [22:03:25] huh! [22:03:35] okay, milimetric is next on my list [22:03:51] can you make a note for me to look into that tomorrow? [22:03:51] hey dschoon [22:03:53] hihi [22:03:56] sorry about the wait [22:04:06] no i was gonna say I can't work for the next hour or so [22:04:07] yup [22:04:09] I gotta cook dinner [22:04:11] lol [22:04:12] okay. [22:04:16] yeah, it's been a long time :) [22:04:28] kraigparkinson: i created a few cards around improvements in monitoring packetloss [22:04:34] also, drdee ^^ [22:04:38] questions welcome [22:04:39] i'm looking into that "what skin are active editors using" question [22:04:49] okay. [22:04:50] but we have to set aside some time tomorrow to talk, do you have that? [22:04:59] i don't think that's really important... [22:05:00] or tonight after 7:30 EST [22:05:09] well, I ran out of stuff to do [22:05:11] that's 4:30PST, right? [22:05:13] yes [22:05:13] yes. [22:05:21] then asap [22:05:23] i'll still be here, then [22:05:41] milimetric, you ran out? :( Let me look into my bag for something... [22:06:06] otto isn't around, but i think locke is about to alert again. [22:06:07]