[02:24:02] (PS1) Rfaulk: mv. aggregator method to relevant module - decouple from metrics. [analytics/user-metrics] - https://gerrit.wikimedia.org/r/79169 [02:24:03] (PS1) Rfaulk: fix. refs to aggregator method. [analytics/user-metrics] - https://gerrit.wikimedia.org/r/79170 [02:24:23] (CR) Rfaulk: [C: 2 V: 2] mv. aggregator method to relevant module - decouple from metrics. [analytics/user-metrics] - https://gerrit.wikimedia.org/r/79169 (owner: Rfaulk) [02:24:35] (CR) Rfaulk: [C: 2 V: 2] fix. refs to aggregator method. [analytics/user-metrics] - https://gerrit.wikimedia.org/r/79170 (owner: Rfaulk) [02:35:13] (PS1) Rfaulk: fix. handling of query string in url first filtering. [analytics/user-metrics] - https://gerrit.wikimedia.org/r/79171 [02:35:33] (CR) Rfaulk: [C: 2 V: 2] fix. handling of query string in url first filtering. [analytics/user-metrics] - https://gerrit.wikimedia.org/r/79171 (owner: Rfaulk) [13:28:38] qchris: hey [13:28:45] hi ottomata [13:28:56] Hi average! [13:29:14] hihii [13:32:44] qchris: the spacing is a problem for me, I'm basically using eclipse when editing the code [13:32:55] but it somehow doesn't want to space properly [13:33:01] I don't know what to do about it atm [13:34:10] If you want to, I can fix the remaining things and push the patch set [13:34:28] but eclipse allows you to adjust tabs/spacing behaviour [13:34:39] Let me look up the options ... [13:37:02] Select the project "kraken-dclass". Then in the menu "Project" select "Properties". [13:37:16] Underneath "Java Code Style" there is "Formatter" [13:37:29] Check "Enable project specific settings" [13:37:54] Click the "New" button underneath "Active profile" [13:38:35] Then you can select Tab handling in "Indentation" / "General settings" / "Tab policy" [13:39:16] I use "Spaces only" [13:39:20] Indentation size: 4 [13:39:24] Tab size: 8 [13:39:36] Those settings match what drdee used for the other parts of the code. [13:39:45] Then hit "OK" [13:40:16] Select the generated profile in the "Active profile" combobox [13:40:25] And click "OK" again. [13:40:46] Then your tab/space-life is hopefully easier :-) [13:41:05] done [13:41:07] thanks [14:03:31] hiya, Snaps, you there? [14:10:05] hi milimetric [14:10:13] yesterday fell asleep [14:10:25] was exhausted [14:10:36] uhm, we can have a look at wikimetrics today if you have time [14:14:57] ottomata: I tried to verify some of the mobile log's numbers using ganglia, but I fail. Would you have some time today to help me with it? [14:21:56] qchris, yeah let's check it out [14:22:09] ottomata: Thanks. [14:22:24] so hm, yeah those numbers look like the eqiad mobile hosts just dropped in counts, right? [14:22:29] ottomata: So does tho "Total requests" in ganglia mean total requests served by the node? [14:22:46] Yes. [14:23:30] When looking at [14:23:37] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Mobile%20caches%20eqiad&h=cp1046.eqiad.wmnet&r=hour&z=small&jr=&js=&st=1376574428&v=869015108&m=varnish.s_req&vl=N%2Fs&ti=Total%20Requests&z=large [14:24:02] It looks like we'ro averaging around 400 requests/second [14:24:43] So we'd expect some 34560 requests from the webserver per day in the sampled-1000 stream [14:25:08] However, we're seeing around 53000 [14:25:17] Does that look sane? [14:27:51] yeah that sounds right [14:28:20] Oh ... mhmm. How comes we're seeing more hits than ganglia? [14:30:12] ees good question [14:30:33] except for today, right?! [14:30:33] maybe the universe just fixed it for us! [14:30:34] problem solved!? :) :p [14:30:34] also this page [14:30:34] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=erbium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1376576876&v=-0.0801058333333&m=packet_loss_average_eqiad_mobile_cp&vl=%25&z=large [14:30:36] not really helpful though it hink [14:31:21] those are all very small percentages, so it looks like we can't blame udp2log [14:31:52] Yes. [14:31:54] hmm, wait weird [14:32:02] actually sampled comes from emery [14:32:03] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&r=hour&z=default&jr=&js=&st=1376577088&v=-0.01285125&m=packet_loss_average_eqiad_mobile_cp_1046-1060&vl=%25&z=large [14:32:46] I like negative package loss :-) [14:33:04] But anyways it's really, really small as well. [14:33:20] actually, this one [14:33:52] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&r=hour&z=default&jr=&js=&st=1376577088&v=0.5047&m=packet_loss_90th_eqiad_mobile_cp&vl=%25&z=large [14:33:52] that's the proper one [14:33:52] oh, no that's 90th [14:33:52] this one [14:33:53] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&r=hour&z=default&jr=&js=&st=1376577088&v=0.129658333333&m=packet_loss_average_eqiad_mobile_cp&vl=%25&z=large [14:33:53] but yeah [14:33:54] same [14:33:54] ok [14:34:33] i'd love to see ezachte's chart on this [14:34:46] it would show if the sequence numbers for those hosts are too far apart [14:34:46] :-) [14:34:58] That would be good to see. [14:35:18] I checked the puppet repo, but it did not seem to contain anything related. [14:35:32] Two machine were recommissioned [14:35:40] But they should not affect us. [14:36:19] recomissioned? [14:36:22] or de? [14:36:30] recommissioned. [14:36:51] "Recommission cp104[34]" [14:36:59] Commit: a7a7c56df6f6699ed73da7ebaa14944cb9341166 [14:37:15] I also read it twice as I thought they'd mean decomission. [14:37:53] Namewise they look sufficiently close to the mobile hosts, but [14:38:10] unless this somehow caused the load balancer to send them traffic [14:38:13] I checked against the sampled-1000 stream, and it did not contain lines from them. [14:38:14] and we aren't collecting that traffic [14:39:13] So I do not think that this is the case (assuming sampled-1000 is correct) [14:39:49] well, i just manually confirmed, I see about 400 reqs/sec for cp1046 [14:40:09] and that would match up with the most recent counts in the sampled files [14:40:13] Mhmm that matches what we see in our logs. [14:40:15] Yes. [14:40:33] so, the question now is, why were there more 2 days ago? [14:41:17] The numbers in ganglia did not change in the last few days, and they always reported around the 400 requests/second [14:42:17] At least the sampled-1000 and mobile-100 both saw the same drop. [14:42:29] So they agreed before and after the drop. [14:42:55] (when filtering to mobile caches and making up for the different sampling rate) [14:43:34] How could we find out whether the old or the new data is good? [14:43:41] right [14:43:49] well, checking distances between sequence numbers would be good [14:44:07] Ok. Let's wait for Eric then. [14:44:12] we could check the mobile-100 for both times, and see what distances are [14:44:22] if they are greater than ~100, then we know we've lost data [14:44:29] or, maybe we have duplicates in the old data? [14:44:34] we can check that [14:45:02] I checked when I started working on the logs, and we did not have duplicates back then. [14:45:08] I'll recheck. [14:45:09] hm ok [14:52:51] ottomata: I rechecked for the last three days, and they did not contain duplicates [14:53:48] k cool, i'm seeing if I can get some average distances [14:57:17] yoooo guys [14:57:18] back in EST [14:57:34] Hi drdee welcome to the wrong timezone :-) [15:00:10] ty! [15:00:15] about dclass [15:00:17] that sucks [15:00:22] and sounds like a new card :) [15:00:23] Yes, it does :-( [15:00:29] New card? :-D [15:00:32] yup [15:00:34] Sure hehe. [15:01:06] But what does it mean for the kraken dependencies? [15:01:11] Do we settle with the [15:01:17] new dclass but old data for now? [15:02:20] (Old data is no longer in the wikimeda apt repo) [15:03:22] qchris: old data can be found here(the old dtree files) http://garage-coding.com/releases/libdclass-dev/ . Forgot which version of the package we were using before though [15:03:33] hi drdee [15:03:53] average_: Ok. [15:04:02] let's stick with the old data for now [15:15:57] qchris, everything in both 08-13 and 08-15 is looking pretty normal so far [15:16:00] check it [15:16:37] ottomata: Mhmm ... even the drop from 60K/day to 36K/day. [15:16:42] ? [15:17:41] zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130813.gz | grep 'cp1059.eqiad.wmnet' | /home/otto/mobile/seq_dist.sh [15:17:49] and [15:17:54] zcat /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20130815.gz | grep 'cp1059.eqiad.wmnet' | /home/otto/mobile/seq_dist.sh [15:18:09] Cool. [15:19:15] if both of these files have an average distance of 100 [15:19:52] if the average seq distance for a host is 100 [15:20:02] then that means that sampling and packet loss was fine [15:20:11] and that the host itself generated fewer requests [15:20:22] That's good. [15:20:44] But can it be a load balancer uses a new hast that we do not yet add input from? [15:21:03] Because we drop more than 10% in a single day. [15:21:39] It's good to see that we do not suffer packet loss. But the total looks suspicious. [15:22:15] yeah totally [15:22:20] i think we have to ask ops now [15:22:26] Wait ... [15:22:36] Your script only check for positive sequence numbers ... [15:22:44] Didn't we also have negative ones? [15:23:09] Sorry. [15:23:11] My bad. [15:23:41] Let's ask ops. [15:33:33] no, no negative seq nums [15:34:29] ottomata: :-) yes, stupid me. I got confused. [15:34:48] yeah, the negative packet loss comes from statistical error stuff I think [15:35:13] ottomata: Btw I was just about to ping the RT op about our problem... [15:35:22] ottomata: then I realized that would be you anyways. [15:35:28] haha, yup [15:35:39] ottomata: So as no one responded in #ops ... whom could we ask? [15:35:39] yeah, mark isn't online [15:35:41] he's the one I'd ask [15:35:47] did you get on ops@ list yet? [15:35:50] Yes. [15:35:52] yay! [15:35:53] ask there [15:36:09] Ok. [15:39:22] ottomata: yep [15:40:18] yo Snaps, s'ok, i sent an email instead :) [15:40:44] ottomata: got it! [16:24:42] drdee, no standup today, right? [16:25:05] no, we have standup today [16:49:22] hey qchris [16:49:31] Hi drdee [16:49:55] shall we just wrap up https://gerrit.wikimedia.org/r/#/c/75349 today? [16:50:38] I am firefighting around Wikipedia Zero [16:51:04] Once we figured out what's happening there, [16:51:19] I'll fix the 75349. Is that ok? [16:56:29] cool [16:58:07] drdee, calendar says grooming today [16:58:09] but no standup [16:59:07] oink [16:59:11] no it's thursday [16:59:14] we should have standup [16:59:24] i think we have both meetings [16:59:28] let's talk about it during scrum [17:01:48] ottomata: scrum [17:02:22] SO MANY MEETINGS!!! :) [17:24:59] on my way.... [20:35:47] oh milimetric, i found a small oauth issue: https://mingle.corp.wikimedia.org/projects/analytics/cards/1069 [20:36:23] oh right [20:36:37] I did test that but a couple small things changed since my test [20:36:45] good catch, we'll fix it with low priority? [20:40:31] sure [22:07:18] milimetric, hangout? [23:12:51] (PS10) QChris: Updating Kraken to cope with libdclass 2.2.2 [analytics/kraken] - https://gerrit.wikimedia.org/r/75349 (owner: Stefan.petrea) [23:16:50] (PS11) QChris: Updating Kraken to cope with libdclass 2.2.2 [analytics/kraken] - https://gerrit.wikimedia.org/r/75349 (owner: Stefan.petrea) [23:18:57] (CR) QChris: "(10 comments)" [analytics/kraken] - https://gerrit.wikimedia.org/r/75349 (owner: Stefan.petrea)