[07:45:19] Analytics / Wikimetrics: Cannot edit or delete cohorts - https://bugzilla.wikimedia.org/67664 (Rahmanuddin Shaik) UNCO p:Unprio s:normal a:None Hi, I cannot edit a cohort, adding more usernames and removing irrelevant usernames (those found invalid, etc) is a good feature. Also, I cannot d... [08:50:19] Analytics / General/Unknown: page view statistics for Wikinews seem to be wrong - https://bugzilla.wikimedia.org/67411#c9 (christian) NEW>RESO/FIX Ssl requests get fed into webstatscollector again. [10:13:15] (CR) Nuria: [C: 2] "I have tested these changes on eswiki on dev populating about a year back of reports, things are working well and the throttle has worked " [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/142007 (https://bugzilla.wikimedia.org/66841) (owner: Milimetric) [10:13:25] (Merged) jenkins-bot: Remove limit on recurrent, add throttling [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/142007 (https://bugzilla.wikimedia.org/66841) (owner: Milimetric) [10:13:28] yay [10:49:03] Analytics / Refinery: Story: Admin has versioned and sync'ed files in HDFS - https://bugzilla.wikimedia.org/67129 (christian) a:christian [11:35:03] (PS1) QChris: Add repository description [analytics/refinery] - https://gerrit.wikimedia.org/r/144676 [11:35:05] (PS1) QChris: Add basic deployment script [analytics/refinery] - https://gerrit.wikimedia.org/r/144677 (https://bugzilla.wikimedia.org/67129) [11:49:34] (PS1) Yurik: Initial checkin - some data sanitizing code [analytics/zero-sms] - https://gerrit.wikimedia.org/r/144682 [12:27:28] nuria: http://test-reportcard.wmflabs.org/graphs/newly_registered [12:27:29] :( [12:27:50] it happens when I run more than one recurrent report at a time [12:28:05] I'm thinking one run doesn't get a chance to finish before the next one starts, so I'm testing that theory now [12:33:48] oh oops, pasted wrong thing [12:33:54] http://pastebin.com/0fXJyq5D [12:51:20] (CR) QChris: Add basic deployment script (4 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/144677 (https://bugzilla.wikimedia.org/67129) (owner: QChris) [12:55:46] ok, figured it out [12:55:48] 2 problems [12:56:15] 1. pickle bug / limitation that causes the recursion limit to get hit for long chains of subtasks (this will happen if we try to backfill too much) [12:57:13] 2. overlapping runs cause index violations when they try to insert, harmless problem really [12:57:39] Harr. Why cannot things be easy? [12:59:33] :) [12:59:55] problem 2. is really a non-issue because we're running these things at a 2 minute increment just for testing / to generate data faster [13:00:06] in real life it'll run nightly and the good news is that it finishes really fast [13:00:23] But pickle got in the way before. csalvia spent some time fixing issues around it :-( [13:00:25] so we should never overlap. Then, even if we do, that index actually protects us from any bad consequence [13:00:35] yeah, the pickle issue is a bitch [13:00:42] there's a work-around: sys.setrecursionlimit(10000) [13:00:45] but... yeah :D [13:00:47] :-) [13:00:56] feels *very* wrong [13:01:05] I guess if the problem occurs again .... sys.setrecursionlimit(20000) [13:01:17] haha [13:01:24] we could switch away from pickle too [13:01:31] might not happen with other serializers [13:01:44] and I think meanwhile we fixed the serialization problem that was preventing that switch (or we should anyway) [13:01:51] really really time for that tech-debt sprint [13:02:08] Sure. :-) [13:02:59] heh, amusingly, I just ran 5 projects at the same time, backfilling for 2 years [13:03:05] it works but in a funny way [13:03:24] 1st run - creates pending reports for 1st project (doesn't finish yet) [13:03:44] 2nd run - integrity error creating pending reports for 1st project, goes onto second project (doesn't finish) [13:03:57] 1st run - gets to 2nd project, but 2nd run beat it there, so it goes to 3rd project [13:04:13] so they stagger... sort of, and then 3rd run starts and it's ALL confusing at this point [13:04:26] but basically, it all works out as long as recursion limit is high [13:04:37] Hahaha :-D [13:06:38] resilience through ignorance! [13:21:25] well, massive explosions and everything, wikimetrics just ran 2770 reports in about 20 minutes [13:23:29] \(^_^)/ [13:47:55] milimetric, problem 2) we alredy knew about, we talked about it the other day [13:49:11] milimetric: part of teh problem that preventing the pickle switch is fixed, yes [13:49:39] it is a matter of not serializing to the queue the whole world [13:56:43] *problem 2) we already knew about, [15:51:47] kevinator: can you share the slide deck/ [15:51:55] I would liek to add couple more slides [15:52:01] *I would like [15:52:13] https://docs.google.com/a/wikimedia.org/presentation/d/1Y2uI_oOhXGpcn8y-EHBAAqzxS2Lp5OFEXIkIkd6J6A0/edit#slide=id.ge048dd32_142 [15:52:39] I created a copy with the same permissions. I guess it didn’t notify you [15:59:03] nuria: did you see the link I posted? [15:59:19] in a meeting now , will look [16:07:27] whoops, I forgot to sign on [16:07:32] whoops, I forgot to sign on [16:07:34] kevinator: I'm around now if you pinged me [16:12:04] I posted the link above… can you have a look at it? [16:13:31] I wonder if I need to record somewhere how many points were closed this sprint (45). If a story gets closed later does Scrumbugs know not to update the points remaining in an old sprint? [16:14:12] milimetric: i was talking to you [16:14:23] right [16:14:28] I wasn't logged on so can you re-post? [16:14:59] I'm not 100% sure but I think Scrumbugz updates old sprints. However, if we don't close stories we should move them to the next sprint [16:15:39] https://docs.google.com/presentation/d/1Y2uI_oOhXGpcn8y-EHBAAqzxS2Lp5OFEXIkIkd6J6A0/edit?usp=sharing [16:16:22] IIRC we are only going to showcase schema counts in Graphite and Newly Registered User [16:26:27] kevinator: the 34 points / developer / sprint is without interruptions [16:26:43] so that's kind of the "ideal" [16:27:22] but we're still working on calibrating, so not sure if it's worth it to mention points. Just basically: "we've adjusted the points so that we have a broader range, comparing to previous sprints isn't too relevant" [16:27:46] and yes, demo-wise, just those two [16:28:03] if you clean up slide 5 (stray bullets), looks good [16:28:23] a LOT easier since Scrumbugz, eh?! [16:30:50] yes… a lot easier [16:31:00] I removed teh reference to 34 points [16:31:22] The stray bullets were to write the names of presenters. [16:31:33] is nuria presenting schema counts in graphite? [16:32:29] kevinator: done with meeting, let me include couple slides [16:32:53] thanks [16:43:39] kevinator: can you give edit permits to all users [16:43:45] please [16:44:46] done [16:47:44] kevinator: i have added two slides, since regarding counts wht we have to show is very little you can share the deck while i talk briefly [16:48:00] ok [16:48:20] Kevinator: just run through teh whole presentation fast in your laptop now so all slides and images are chached [16:48:23] *cached [16:48:48] done… i got them [16:49:58] I’m on my way to conference room now [17:22:04] hi, I have made this wikipedia ranking https://tools.wmflabs.org/ptwikis/Wikirank , but I want to make it more precise by counting characters instead of bytes, is there any way to count characters in articles? a bytes per character coeficient for each language perhaps? [17:25:43] danilo_, without extracting the dumps or poking the API constantly? [17:25:53] hrm. Probably not :/. [17:29:15] ok :/ thanks [19:23:23] qchris / nuria: I'm going to try adding this to our /static/public directory configuration in Apache (in staging): [19:23:24] Header set Access-Control-Allow-Origin "*" [19:23:43] let me know if you think that's a crappy idea, I'm thinking it's fine [19:40:07] milimetric: I lack experience with this header. But reading about it does not make my alarm bells go off. [19:40:19] yeah [19:40:24] Are we doing it to solve an issue people are having, or just to be proactive? [19:40:36] agreed. now the problem is how to set it and make it pass through the weird proxy thing we have set up [19:40:49] :) thanks qchris! [19:56:36] milimetric [19:56:40] that is teh cors [19:56:53] so anyone can request those files in x-domain [19:57:03] yeah, i know [19:57:22] but it is teh broken through our proxy :( [19:57:38] are you adding it to flask? [19:57:50] I would add it to apache for that dir only , right? [19:58:33] yeah, i'm trying to add that line to apache's config for that line [19:58:44] but apache serves through nginx [19:58:56] and seems to not pass that along [19:59:13] mm...maybe hedaers are whitelisted on nginx side [19:59:36] BTw, we should also add cache headers while we are at it. [19:59:44] *headers [20:01:28] yes, indeed [20:03:22] milimetric: in meeting , will check back in a bit [20:03:27] no prob nuria, I know [20:03:41] I'm just wondering out loud, I'll go bother labs people [21:44:05] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694 (christian) NEW p:Unprio s:normal a:None It seems oxygen and analytics are having packetloss issues. Currently >10%. From http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-... [21:44:39] milimetric: You said you wanted to grab production issues :-) ... ^ [21:45:05] yep, ok [21:45:54] Thanks :-) [21:47:18] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694 (Dan Andreescu) a:Dan Andreescu [21:48:02] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694 (Dan Andreescu) [21:56:28] thanks guys -- let me know if you need anything from me [21:57:33] Analytics / Refinery: Story: Admin has duplicate monitoring in Icinga - https://bugzilla.wikimedia.org/67128 (Dan Andreescu) a:Dan Andreescu>christian [22:03:33] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694#c1 (Dan Andreescu) Will write here anything I find. So it looks like packets into Analytics1003 increased by an unusual margin starting around 20:30 and going through 22:00 at the time o... [22:07:38] milimetric: check out this kafka graph [22:07:38] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&tab=v&vn=kafka&hide-hf=false [22:07:50] looks like 1022 (upload server) is spiking? [22:08:31] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694#c2 (Toby Negrin) Kafka graphs show a spike from 1022: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&tab=v&vn=kafka&hide-hf=false [22:09:32] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694#c3 (Dan Andreescu) according to puppet, analytics1003 is set up with role::analytics::kafkatee::webrequest::mobile, but I'm not too familiar with the infrastructure. My theory so far is... [22:09:50] yeah, tnegrin, I was just commenting the same thing :) [22:09:57] you think it's world cup semis too right? [22:10:10] huh. Okay, I hadn't thought of that ;p. [22:10:11] well -- yeah -- it is from europe [22:10:25] my bad -- VA [22:10:39] milimetric, so these are requests for images, nooot uploads? [22:10:51] anyone know how to dump the kafka stream? [22:11:11] yeah, upload serves images that have been uploaded, I think [22:11:22] and also takes upload requests of course [22:11:30] hmn. [22:11:52] checking kafka graph closely now [22:13:08] I'm going to try and grab chad and see if there's any data we can get from the spike period itself [22:13:12] (if we act fast) [22:13:43] yeah, the spike is clear here: http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics1003.eqiad.wmnet&mreg[]=kafka.rdkafka.topics..%2B%5C.next_offset.per_second&z=large>ype=stack&title=kafka.rdkafka.topics..%2B%5C.next_offset.per_second&aggregate=1&r=4hr [22:14:09] it's just too coincidental that it started at the same exact time as the match [22:15:25] yup [22:15:57] totally [22:17:01] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694#c4 (Dan Andreescu) problems seem fairly contained to esams, for example: http://ganglia.wikimedia.org/latest/?hreg[]=%28amssq%7Ccp%29.%2B&mreg[]=kafka.varnishkafka%5C.kafka_drerr.per_sec... [22:17:10] http://ganglia.wikimedia.org/latest/?hreg[]=%28amssq%7Ccp%29.%2B&mreg[]=kafka.varnishkafka%5C.kafka_drerr.per_second&z=large>ype=line&title=kafka.varnishkafka%5C.kafka_drerr.per_second&aggregate=1&r=4hr&dg=1&tab=v [22:17:13] just to check, is this data being thrown on the floor or going somewhere? Only this seems like it'd make a fascinating writeup [22:17:32] thrown on the floor from udp2log, but hopefully not from kafka, but maybe [22:17:50] unfortunate timing on a team decision we *just* made to have me handle these kinds of problems [22:17:58] obviously christian would have had an actual answer :) [22:18:08] yeah, but is kafka storing it anywhere? Like: is hive/hadoop down, or just inaccessible? [22:18:18] non-urgent question, I just think it could make an interesting writeup [22:18:24] yeah, if it's not dropping it, it should be all written to hdfs [22:18:34] we can ask otto tomorrow where the kafka data is [22:18:38] "7, 1, 50k: what the world cup means for data" [22:18:39] I assume it's on the brokers [22:18:42] *nods* [22:18:42] unless the cluster is still down, which I don't remember what andrew's last status on that was [22:19:30] so, it's not coinciding with when the game started, but with when drdee said "holy shit" [22:19:37] followed shortly by tnegrin saying "holy shit" [22:19:44] so I'm assuming something crazy happened [22:19:48] DON'T TELL ME [22:19:54] milimetric: udp2log drops. kafka should have them stored. [22:19:59] hehe [22:20:01] qchris, yay! [22:20:02] :) [22:20:03] right? [22:20:11] #guardianangel [22:20:12] go analytics engineering :D [22:20:16] you know what this means, right? [22:20:25] Our analytics infrastructure is better at handling the world cup than Brazil [22:20:28] lol [22:20:41] :-P [22:20:42] Ironholds for director [22:20:43] DON"T TELL HIM [22:20:46] sorry! [22:20:50] lol [22:20:51] milimetric, you should go watch the game. [22:20:54] HE'LL BE CROSS [22:20:55] seriously [22:21:02] For best effect, play it double-speed with the benny hill theme tune in the background. [22:21:04] ok guys, i'm out, doesn't look like anything urgent [22:21:08] you should watch it milimetric [22:21:09] lol, got it [22:21:09] but that's my advice for all sporting events, so.. [22:21:14] agreed [22:21:27] :-) [22:21:28] you can send you live commentary here. haha [22:22:43] kafka staying up, though? That's awesome. Talk about road-testing stuff! [22:22:58] * Ironholds is tempted to send Otto an email with a big thumbs up [22:23:01] let's see how it handles popes [22:23:05] then I'll be impressed [22:23:30] whether it'll work, or pope the hell out, you mean? [22:27:08] (look, I make /myself/ laugh ;p) [22:29:22] Just gotta give kafka enough pope to hang itself with [22:33:49] ok, signing off for good now [22:38:55] (PS1) QChris: Adapt default auxpath for cluster setup [analytics/refinery] - https://gerrit.wikimedia.org/r/144842 [23:09:57] Ironholds: How have you not groaned at me [23:23:45] marktraceur, for what? [23:23:56] 2014-07-08 - 15:29:22 Just gotta give kafka enough pope to hang itself with [23:28:19] uuugj [23:37:16] Analytics / General/Unknown: Packetloss issues on oxygen (and analytics1003) - https://bugzilla.wikimedia.org/67694#c5 (Bawolff (Brian Wolff)) We've had some limited reports that some deleted files are being purged, which could be caused by packetloss of the htcp packets. I wonder if these issues are r... [23:56:03] Analytics / Refinery: Story: Admin has duplicate monitoring in Icinga - https://bugzilla.wikimedia.org/67128#c6 (Kevin Leduc) Moving story to next sprint since it has not been completed this Sprint. [23:56:16] Analytics / Refinery: Story: Admin has versioned and sync'ed files in HDFS - https://bugzilla.wikimedia.org/67129#c5 (Kevin Leduc) Moving story to next sprint since it has not been completed this Sprint.