[00:22:44] milimetric: you around for a quick q? [00:31:25] ottomata are you around as well? [00:49:39] New review: Ori.livneh; "I didn't attach a score to my previous review, but doing so now in case my comment slipped notice." [analytics/log2udp2] (master) C: -1; - https://gerrit.wikimedia.org/r/58449 [00:54:21] New patchset: Erosen; "adds json endpoints /cohorts/list/ and /jobs/list/ and corresponding fixtures/tests. adds default json serializability to all sqlalchemy mapped classes using the `cls` argument of the declarative_base() factory method." [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70359 [00:54:21] New patchset: Erosen; "merges job controller conflict" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70360 [13:40:21] moooorning [15:11:36] Change merged: Erosen; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70359 [15:12:21] Change merged: Erosen; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70360 [15:29:06] New patchset: Erosen; "updates database design doc to include cohort_user table, which essentially captures permissions" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70430 [15:29:31] Change merged: Erosen; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70430 [15:38:09] !log, stopping udp2log-webrequest kraken producer instances on an03, an04, and an05, now using only an06 to produce the webrequest-wikipedia-mobile stream instead of sharding the stream with awk across 4 hosts [15:41:10] 4tr3ll& [16:08:27] tnegrin: are you giving us all your passwords!? [16:08:44] yes -- radical transparency [16:08:52] my client has some modality issues [16:08:58] but yeah that was stupid [16:09:00] awesome, i like it [16:09:07] run wild my friend [16:09:33] hey, q for you: got any rebalancing tips? it takes FOEVAAAHHH [16:09:45] i could up the bandwidth or something, i'm just running with no args right now [16:09:51] maybe the running jobs are interferring? [16:09:55] welcome to big data [16:10:21] my team tended to do blocks of 5 or so at a time [16:10:28] but it always took a long time [17:21:43] New patchset: Milimetric; "working job creation prototype" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70444 [17:22:17] Change merged: Milimetric; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70444 [18:11:45] erosen: i pm'ed you a hangout link [18:32:49] erosen, drdee: [18:32:49] http://stats.wikimedia.org/kraken-public/webrequest/mobile/zero/ [18:33:31] ottomata: that has data from june 1st - june 23th? [18:33:46] that is what i'm grabbing [18:33:49] yes [18:33:54] carrier is - 24th [18:33:57] the 24th country job died [18:34:01] drdee is looking into that? [18:34:16] yes [18:35:10] ottomata: drdee: http://gp.wmflabs.org/dashboards/orange-kenya [18:35:15] not entirely encouraging [18:36:01] haha [18:36:10] thurs 13 it just drops? [18:38:26] erosen: can you check with yurik if they have made changes to orange kenya cidr ranges? [18:38:39] sure [18:38:58] and can you make an additional couple of dashboards so we can see whether this is a carrier specific issue or a general trend [18:39:04] yeah they are all created [18:39:09] I'm formatted an e-mail with all the urls now [18:39:25] k [18:43:33] drdee: ottomata: just sent e-mail with urls [18:44:13] cool [18:44:19] hey erosen, ok, I'm ready to talk anytime [18:44:21] you free? [18:44:44] milimetric: yeah-ish [18:44:49] just finished up some zero stuff [18:44:59] ok, let me know, i'll read up on WTForms [18:45:10] ottomata: drdee: other graphs confirm that there seems to be an issue with all of the dashbaords [18:45:21] so I don't think it is a CIDR range issue [18:45:24] milimetric: want to hangout? [18:45:53] mmmmmm can we check with W0 team? we did not change anything on our side [18:46:01] regarding X-CS parsing [18:46:30] drdee: sure [18:48:07] hmmMm [18:48:17] yeah it is strange that in the middle of this dataset it would all change [18:48:23] if it was all low it would make more sense [18:48:34] that we did something wrong [18:52:04] ok, i need to brainbounce this deduplicate thing too [18:52:09] who's hanging out? [19:12:26] erosen, you around to brain bounce about deduplication? [19:12:37] yeah, sort of [19:12:42] in a hangout with dan, but can take a break [19:13:34] wheres your hangout i'll just come bug you for a sec [19:13:42] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [19:14:15] can you invite me [19:14:32] come on iiiin [19:28:29] milimetric: want to hangout again? [19:28:43] yes, i think i got good progress [19:30:00] New patchset: JGonera; "Add unique uploaders per day graph" [analytics/limn-mobile-data] (master) - https://gerrit.wikimedia.org/r/70466 [19:30:12] YuviPanda, ^ [19:31:04] jgonera: did you mean to commit the changes to the datasources json files' urls? [19:31:16] YuviPanda, does it matter? [19:31:20] oh wait [19:31:26] there's one thing wrong there [19:31:31] local urls instead of remote [19:31:36] right [19:31:38] yeah [19:31:39] thanks [19:31:48] we need to do something to make local testing easier... [19:31:59] yup, unsure what exactly. [19:32:02] right now I just have two bash scripts which replace them using sed... [19:32:04] plus I still don't have limn setup locally :( [19:32:16] you still have problems setting it up? [19:32:35] jgonera: I had last time, and I didn't need to after that... [19:32:48] jgonera: hmm, also now that I am using vagrant, maybe I should setup a puppet role for it [19:33:00] New patchset: JGonera; "Add unique uploaders per day graph" [analytics/limn-mobile-data] (master) - https://gerrit.wikimedia.org/r/70466 [19:33:03] jgonera: will try later, though. Doing other stuff now. [19:33:13] sure [19:33:18] pushed a new patchset [19:33:21] jgonera: yeah, looking [19:33:41] jgonera: looks good to me, I can do a C+2 [19:33:44] do verify and push :) [19:33:47] ok [19:34:40] New review: Yuvipanda; "lgtm, but someone else needs to V+2" [analytics/limn-mobile-data] (master) C: 2; - https://gerrit.wikimedia.org/r/70466 [19:34:45] jgonera: ^ [19:35:00] you can't merge? [19:35:36] jgonera: I can, but I can't really test it. [19:35:52] jgonera: if you've tested it and it looks good to you, V+2 and self merge? [19:35:56] ok, I tested it, merged ;) [19:36:07] jgonera: :) [19:36:08] ty [19:37:29] New review: coren; "Noted. Patch incoming." [analytics/log2udp2] (master) - https://gerrit.wikimedia.org/r/58449 [19:38:22] HUH! [19:38:24] erosen [19:38:37] i just compared that data for those days in the webrequest-mobile data [19:38:46] (geocoded and anonymized) [19:39:01] -- 2013-06-08 165,364,765 [19:39:02] -- 2013-06-15 164,807,428 [19:40:07] hmmm [19:41:05] New patchset: Milimetric; "starting wtforms implementation" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70467 [19:41:16] Change merged: Milimetric; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70467 [19:42:07] Hmmm even more weird: [19:42:08] http://ganglia.wikimedia.org/latest/index.php?r=month&hreg[]=analytics100%5B3456%5D.eqiad.wmnet&mreg[]=kafka_producer_KafkaProducerStats-.%2B.ProduceRequestsPerSecond>ype=stack&title=kafka_producer_KafkaProducerStats-.%2B.ProduceRequestsPerSecond&aggregate=1&dg=1&tab=m [19:43:00] i think something is up with the produce requests jmx stat reporting [19:43:20] this morning? [19:43:29] it looks like it's been hinky for a couple of weeks [19:43:30] it looks to me like if the producer dies, jmxtrans or ganglia or something just keeps reporting the previous value [19:43:32] yeah [19:43:44] but that hinkyness corresponds to data weird ness too [19:43:44] and [19:43:47] indeed [19:43:55] YuviPanda, do you remember the fab command for deploying to our dashboard? I have to write it down... keep forgetting [19:43:58] we (up until this morning) were sharding the stream across all 4 of these hosts [19:44:08] it looks like in the data from these hosts [19:44:12] we are missing about 1/2 of requests [19:44:16] jgonera: bash history says 'fab mobile deploy.only_data' [19:44:20] which would also corroborate my suspeicion [19:44:32] that the udp2log producers on 2 out of 4 machines weren't working [19:44:39] jgonera: deploying to _dev is 'fab mobile_dev deploy.only_data' [19:44:41] is this regarding wikipedia zero? [19:44:46] yes [19:44:46] that would make sense [19:44:51] also [19:45:10] there was also a code deployment of june 12th by the zero team that precipitates the drop [19:45:18] we are not missing data in the webrequest-mobile stream (webrequest-mobile is from a single udp2log producer, geocoded and anonymized) [19:45:44] something is def wrong with the monitoring though: [19:45:45] btw milimetric, got this when trying to do the walktrhough version of fab: https://gist.github.com/jgonera/8d4858473ad272234d86 [19:45:45] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=analytics1004.eqiad.wmnet&c=Analytics+cluster+eqiad&m=kafka_producer_KafkaProducerStats-webrequest-wikipedia-mobile.ProduceRequestsPerSecond [19:46:00] i shut down that udp2log kafka producer & jmxtrans instance this morning [19:46:00] taking a look jgonera [19:46:08] and ganglia still has a positive value for produce reqs / sec [19:46:13] YuviPanda, got it names changed though: fab mobile_reportcard_dev deploy.only_data [19:46:15] thanks [19:46:17] that's odd as well [19:46:25] what was the command you used jgonera? [19:46:31] milimetric, fab [19:46:44] New review: coren; "(1 comment)" [analytics/log2udp2] (master) - https://gerrit.wikimedia.org/r/58449 [19:46:55] well that would've certainly errored as fab expects a stage :) [19:47:07] like "fab dev_reportcard deploy.only_data" [19:47:08] tnegrin, drdee, want to hangout so I can elaborate more? (i imagine there are lots of moving pieces to keep track of for tnegrin) [19:47:16] milimetric, I updated the girst [19:47:16] y [19:47:18] gist [19:47:18] k [19:47:22] it asked me for a stage before [19:47:43] right, so you need to give it a stage and a command [19:47:45] k tnegrin, drdee: [19:47:45] https://plus.google.com/hangouts/_/446685b3720fffc8cfcd115b7a058f019f4cad05 [19:47:46] i'm here [19:47:49] others are in batcave [19:47:49] so which stage are you trying to deploy to [19:48:22] Change merged: JGonera; [analytics/limn-mobile-data] (master) - https://gerrit.wikimedia.org/r/70466 [19:48:30] oh, you mean like it used to interactively ask you for a stage? [19:48:35] yes [19:48:41] oh! I had no idea [19:48:57] yeah, i think david must've messed with that [19:49:12] or i broke it accidentally but this is not a showstopper right? [19:53:14] jgonera: ^ [20:02:55] New patchset: coren; "New implementation of log2udp" [analytics/log2udp2] (master) - https://gerrit.wikimedia.org/r/70531 [20:03:47] Change abandoned: coren; "... how did that happen?" [analytics/log2udp2] (master) - https://gerrit.wikimedia.org/r/70531 [20:06:17] milimetric, not a showstopper, I just thought I had this last resort option if I forger the right command ;) [20:06:36] oh but it lists the stages still [20:07:07] you're welcome to fix it btw, i'm just working on something else atm [20:09:00] New patchset: coren; "New implementation of log2udp" [analytics/log2udp2] (master) - https://gerrit.wikimedia.org/r/58449 [20:15:52] milimetric, well, me too ;) I just have one more question: should I add datafiles (generated csv) to the repo when adding new graphs? I deployed to http://mobile-reportcard-dev.wmflabs.org/ but when I go to Uploads daily, the first graph is empty [20:16:28] if the csvs are served remotely it shouldn't be necessary [20:16:33] i'll take a look [20:17:19] yeah jgonera, it's because this file's not found: http://stat1001.wikimedia.org/limn-public-data/mobile/datafiles/unique-uploaders-day.csv [20:17:33] so it must have not been created or copied to stat1001 yet [20:18:09] it'd be nice if that error was displayed on the page... but for now you can open up the JS console and see it logged there [20:18:22] so should I put a file in the repo (which will not be up to date tomorrow onwards) or just wait for the generate script to run? [20:18:42] the generate script should run every 30 minutes [20:18:48] but yeah... that's kind of a pain [20:19:14] I'd wait on it before pushing to the remote repository [20:19:55] like, make sure it works on your local instance (wait until generate runs), then commit and push your changes to limn-mobile-data [20:20:54] in general, this way of creating ad-hoc datasets for limn (or anything else) is kind of crappy [20:20:59] we have to figure out a better way sometime [20:23:44] well, it won't run before I commit and merge, so I assume you just meant I should wait with the deployment [20:23:54] that's fine for now, that's why I deployed to _dev first [20:31:01] drdee, do you think there is a better way to extract country from geocoded remote_addr than this? [20:31:03] FLATTEN(EXTRACT(remote_addr, '(.+)\\|(\\w+)')) AS (ip_address:chararray, country:chararray), [20:31:11] DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL(); [20:31:56] that's how i would do it; just test it using ILLUSTRATE in grunt [20:32:03] yeah i'm printing out data [20:32:12] actually, i see we have org.apache.pig.piggybank.evaluation.string.RegexExtract [20:32:13] kool [20:32:15] which looks a little simpler [20:32:23] then I can specify which capture i want [20:32:25] piggybank is great to use as well [20:32:25] but ok coo [20:35:50] New patchset: Milimetric; "working bytes added form" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70534 [20:36:43] Change merged: Milimetric; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70534 [20:49:52] drdee, does ZeroFilterFunc() check the remote_addr like PageViewFilterFunc() does? [20:50:05] i noticed ZeroFilterFunc() isn't erroring out on remote_addr with country code in it [20:50:09] but PageViewFilterFunc() does [20:50:34] Could not parse [|GB] [20:50:34] … [20:50:34] at org.wikimedia.analytics.kraken.pageview.CidrFilter.ipAddressFallsInRange(CidrFilter.java:72) [20:50:34] at org.wikimedia.analytics.kraken.pageview.Pageview.isPageview(Pageview.java:183) [20:50:34] ... [20:51:03] also, if remote_addrs are anonymized, then we won't be able to detect internal IPs here [20:51:09] (which is I think what you are doing, right?) [20:51:18] right that's exactly happening [20:51:25] PageViewFilterFunc filters out internal IP addresse [20:51:35] right, but at this point in this dataset, those have been anonymized [20:51:42] Zero does not do that because we check for X-CS [20:51:56] aye ok, makes sense [20:52:13] right you should keep the geoinfo in a separate variable [20:52:23] and not run the geocoding UDF [20:53:11] right, i'm not [20:53:18] cool [20:53:25] i can make this work, it works for carrier [20:53:26] but. [20:53:39] the filtering for internal IPs is going to through us off [20:53:41] throw* [20:53:52] why? [20:53:54] because remote_addr is random [20:53:56] when anonymized [20:54:08] but the ranges are very small [20:54:10] its just a random number converted to IPaddress format [20:54:16] and so yes it will discard some requests [20:54:20] it could convert a good IP into an internal [20:54:21] but that should not be significat [20:54:23] and an internal into a good IP [20:54:43] on the aggregate, it should barely matter [20:54:58] probably not, except for the its not totally random, it is hashed [20:55:06] so the same ip will always be the same aggregate IP [20:55:15] if there is an IP on either side that generates a large number of requests [20:55:20] that IP count could be discounted [20:55:21] right [20:55:37] but ok, i can make this work as is [20:55:53] if we see a big drop then we can investigate how prevalent this problem is [21:02:56] drdee -- can you walk me through creating a mingle card? [21:03:34] sure [21:03:53] https://plus.google.com/hangouts/_/3453328f33e41aa822ca576cea11fdbfc12d6dcc [21:17:52] Hey drdee [21:17:56] is deducing log lines a dependency for the zero graphs? [21:18:00] dedeuping* [21:18:06] deduping [21:18:07] I am in the above hangout [21:18:35] tnegrin: yes it is a dependency for the zero graphs [21:28:57] it is a dependency for the historical zero graphs [21:31:40] drdee [21:31:48] yooo [21:31:51] why the extra hardcoded '-' field int eh zero_country output? [21:31:54] country_count = FOREACH (GROUP log_fields BY (date_bucket, language, project, site_version, country)) [21:31:54] GENERATE FLATTEN($0), '-', COUNT($1) AS num:int; [21:32:18] ottomata: it was so that the country and carrier files have the same schema [21:32:22] yup [21:33:39] ah that would be the carrier [21:33:40] ok cool [21:36:21] does anyone know how I get a bugzilla account? [21:38:42] tnegrin: you can sign up for one [21:38:53] ok -- doing that now [21:38:55] thx [21:39:05] tnegrin: https://bugzilla.wikimedia.org/createaccount.cgi [21:42:27] thanks - I'm in [21:44:58] tnegrin: this is the one I created today [21:44:58] https://bugzilla.wikimedia.org/show_bug.cgi?id=50195 [21:45:08] pretty bad bug report, but I know what to do :p [21:45:16] thanks -- I already checked :) [21:45:18] New patchset: Erosen; "adds working version of CohortUser model" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70549 [21:45:20] it looks fine [21:45:56] Change merged: Erosen; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/70549 [21:47:52] ottomata: it looks good to me too [22:08:30] hokey dokey, drdee, tnegrin, erosen [22:08:40] sup otto? [22:08:43] jobs are a running with the pre-geocoded dataset [22:08:47] that shouldn't be missing data [22:08:50] /wmf/public/webrequest/mobile/zero [22:08:55] only a couple days there right now [22:08:57] cool! [22:09:01] should catch up in a few hours [22:09:03] awesome [22:09:11] and also be synced to stat.w [22:09:13] stats.w [22:09:14] since we're all here [22:09:28] amit expects to get data back to Feb 1 -- can we do this? [22:09:38] HMmm [22:09:38] yes [22:09:45] with a few days missing in may [22:09:47] but yes [22:09:54] because of the dups? [22:10:00] before the dups we missed a few days [22:10:08] the dups were caused by fixing the missing data [22:10:11] delivery issues? [22:10:12] but ja [22:10:13] yes [22:10:20] network ACL + multicast issues [22:10:24] ok [22:10:28] but [22:10:39] ok -- we can live with that [22:10:42] we can run the backfill on the webrequest-wikipedia-mobile dataset [22:10:47] using the old scripts [22:10:57] and keep the new stuff running using the new scripts [22:10:57] old scripts? [22:11:02] today [22:11:11] i modified zero_*.pig scripts [22:11:21] are these scripts in version control btw? [22:11:26] so that they can use the geocoded and anonymized dataset [22:11:27] yes [22:11:35] great [22:11:41] https://github.com/wikimedia/kraken/tree/master/pig [22:12:42] ok -- so the only outstanding code issue is the problem with the duplicate log lines in May, right? [22:12:48] yes [22:12:53] i'll get back into that tomorrow [22:13:11] if I can figure that out and get that all deduplicated properly [22:13:16] then we can just start the backfill and let it go [22:13:23] or [22:13:24] this wasn't communicated well to you but Amit expects the graphs tomorrow. [22:13:27] great. I'll implement the 30-day normalization and get the dashboard deply working smoothly so that I can use as much data as we have by tomorrow [22:13:34] we could start the backfill now [22:13:39] and just skip may [22:13:41] for now [22:13:49] how much of may is impacted? [22:14:07] i think about 12 days total, 6ish days missing, 6ish days duplicated [22:14:18] give or take a day [22:14:33] do you have a more precise date range? I'd like to let Amit know what to expect [22:14:57] hm, yeshhhh [22:17:59] ok, it looks like [22:18:40] missing data for 05-18-05-21, duplicate data for 05-23- 05-29 [22:18:56] this is from me inspecting the kafka produce request / sec ganglia data [22:19:02] there are more accurate ways [22:19:06] to find out [22:19:24] sorry [22:19:46] missing: 05-18 — 05-22 [22:19:46] duplicate: 05-23 — 05-29 [22:20:08] those don't fall on exact day/hour boundries, but those are the affected days [22:20:50] the missing days are gone forever, but the dups will eventually get fixed [22:20:53] right? [22:21:11] yes [22:21:35] tnegrin: we communicated with amit about the dataloss in may, i just forwarded you an email [22:21:50] no problem -- what did you agree to do? [22:22:18] I think we should go ahead and include the daily may data except for the days with dup problems [22:22:41] erosen: what will that do to the monthly numbers for may? [22:22:43] deduplicate the data [22:22:57] * erosen reading backlog [22:23:21] drdee: can we do this by tomorrow? [22:23:22] it will lower the monthly count [22:23:44] can you normalize to 30 days to cheet? [22:23:48] cheat [22:23:53] tnegrin: we could do some fancy normalization which checks where there is data available [22:24:04] yeah, I actually used to have something like that in there [22:24:21] ok, i gotta run soon, but before I go, i will start backfills for feb-april [22:24:25] and deal with may tomorrow [22:24:28] at least we'll have that bit done [22:24:30] or replace 23-29 may with missing observations [22:24:51] tnegrin: I would prefer not to agree to dealing with the missing data in an especially graceful way by tomorrow [22:24:56] it is something we can do [22:25:09] i think i agree [22:25:13] but it won't break anything at the moment [22:25:30] ok -- I'm down with that -- so Amit will get Feb - June data with some issues in May [22:25:40] which we will fix at some point [22:26:10] agree everyone? [22:26:16] how about no may by tomorrow [22:26:27] we'll get may asap [22:26:40] ottomata: do you mean no data for may whatsoever? [22:26:47] for tomorrow [22:27:09] not even 5/1 - 5/17 which we have? [22:27:25] yarrrrrrrr ok ok ok yeah ok, [22:27:28] yeah ok [22:27:30] hehe [22:27:37] sounds good [22:27:38] its been nice to deal with month boundries for simplicity…but yea [22:27:39] ok [22:27:39] ok [22:27:41] thank you andrew :) [22:27:45] how about may 16 [22:27:50] 1-16 [22:27:54] i want to keep away from the boundry for now [22:28:03] til we fix and understand everything there [22:28:03] sure [22:28:05] makes sense [22:28:06] that way we can be sure the days we're giving are complete [22:28:13] seems reasonable [22:28:27] I'll let Amit know [22:28:32] thanks folks [22:29:46] drdee [22:30:05] looking at the previous carrier script [22:30:17] https://gist.github.com/ottomata/5863055 [22:30:21] oh wait [22:30:33] there: [22:30:33] https://gist.github.com/ottomata/5863055 [22:30:45]