[00:32:29] madhuvishy: ok, after looking into this for 3 hours i know where 25% of our nocookie traffic comes from [00:32:41] madhuvishy: that leaves... ahem.. 75% [00:33:06] madhuvishy: i think 4% are -for sure - users and 20% are -for sure- bots [00:33:15] madhuvishy: will review my numbers [00:36:43] madhuvishy: so to our last access numbers we need to add at least 1% [00:51:21] is there any issue with me temporarily using ~120G on the analytics cluster? ganglia makes it look like its pretty minimal compared to available space, but thought i should ask [00:52:30] basically creating a weeks fill of a reduced pageview_hourly table (only project, page_title, view_count and page_id fields) so i have something with page_id values to work with for testing generation of cirrus scoring information from it [00:52:35] (in /user/ebernhardson) [04:47:46] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [04:51:38] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [09:52:22] (PS1) Addshore: Ignore values with _s in them [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255512 [09:52:38] (CR) Addshore: [C: 2 V: 2] Ignore values with _s in them [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255512 (owner: Addshore) [10:02:44] (CR) Joal: [C: 2 V: 2] "Merging :)" [analytics/refinery] - https://gerrit.wikimedia.org/r/236224 (owner: Joal) [10:28:44] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: More solid Eventlogging alarms for raw/validated {oryx} [8 pts] - https://phabricator.wikimedia.org/T116035#1833425 (JAllemandou) a:JAllemandou>Ottomata [10:29:40] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: More solid Eventlogging alarms for raw/validated {oryx} [8 pts] - https://phabricator.wikimedia.org/T116035#1738860 (JAllemandou) Code looks ok, have been merged, then a ton of icinga check errors were raised. Andrew take... [11:31:31] (Draft4) Addshore: Add rc script [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/255518 [11:46:20] hi joal? [12:08:41] Hey mforns :) [12:09:19] will need to go and give lunch to Lino soon [12:09:35] I have reviewed / re-organised the etherpad a bit [12:10:06] I think now we should look for the link between distinct IPs per ciry and per UAs [12:10:11] mforns: --^ [12:12:45] joal, hey [12:12:47] :] [12:12:56] I saw you organized it [12:13:31] I was reading a bit on k-anonymity and l-diversity and t-closeness [12:13:50] k [12:13:55] and thinking of how we were going to approach the sanitizing [12:14:55] I have some ideas if you want to discuss in the batcave, but first go have lunch, I'll try to get some relation between #ips and #pageviews [12:15:05] great mforns [12:15:17] I'll ping you when the little one goes back to bed :) [12:15:23] cool joal [13:49:44] hey mforns [13:49:49] I'm back :) [13:49:50] hey joal :] [13:49:52] cool [13:49:59] batcave? [13:50:05] sure, omw [13:56:39] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [13:58:46] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [16:19:45] ottomata !!!! Helloooooo :) [16:19:57] Sorry to jump on you on arrival :) [16:20:55] hue is dead I think ottomata :( [16:23:34] HMm ok... [16:23:46] ottomata: sorry :( [16:25:01] !log restarting hue on analytics1027 [16:25:15] ottomata: Do I have the right to do that ? [16:26:39] hm, don't think so [16:26:42] but, i dont' htink it helped [16:26:47] qrfn [16:26:51] you can look at logs in /var/log/hue there though [16:26:56] [26/Nov/2015 16:25:20 ] supervisor ERROR Process /usr/lib/hue/build/env/bin/hue runcpserver exited abnormally. Restarting it. [16:27:02] not sure why yet though.. [16:27:02] Oh ok [16:28:11] hm ok its ok now [16:28:19] it just didn't restart properly because one of the processses didn't shut down [16:28:24] so there was a port conflect [16:28:30] not sure why it broke though [16:28:38] ok, so you actually killed then start instead of restart [16:29:43] yeah [16:29:57] Ok, thanks a lot ottomata :) [16:30:01] joal: btw, afaik your check graphite stuff is totally fine [16:30:19] its out on all kafka_drerr checks for varnishkafka now [16:30:24] yeah, I mean, I expected so, reviewed multiple times, tested etc [16:30:24] i'm going to just enable it for the eventlogging one [16:30:26] But dpeloy ? [16:30:36] and then we can make it happen for all on monday [16:30:43] ok ottomata [16:30:44] just want to keep iteasy to revert until it is not holiday :) [16:30:52] Supposedly no deploy in the next 7 weeks ? [16:31:03] ottomata: --^ will be ok ? [16:31:04] no, not true, that is for mediawiki [16:31:09] ah ok :) [16:31:11] and, it is not all next 7 weeks [16:31:17] just these two [16:31:17] rught :) [16:31:22] and then the few around xmas and the all hands [16:31:47] k makes sense [16:32:39] ottomata: if you apply the change to EL, would be awesome if you could change the metric and thresholds as well [16:34:00] ja am doing that bit, just taking exactly what you had for it and using that [16:34:05] this right? [16:34:05] https://gerrit.wikimedia.org/r/#/c/254846/3/modules/eventlogging/manifests/monitoring/graphite.pp [16:35:05] i also have a vested interest in this thing working, btw [16:35:10] am tired of getting texted by it :p [16:35:10] :) [16:35:19] joal https://gerrit.wikimedia.org/r/#/c/255550/ [16:35:57] awesomeyeah, same for me re3viewing emails :) [17:01:09] ottomata: quick standup? [17:01:27] Saying hello ottomata ;) [17:02:05] okj! [17:11:19] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [17:13:19] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 25.00% above the threshold [20.0] [17:25:32] ok joal, mforns. eventlogging_difference_raw_validated now using --until 5min [17:25:43] great ottomata : Thanks ! [17:33:17] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [30.0] [17:35:17] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [17:53:17] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [30.0] [18:03:37] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [18:09:46] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0] [18:09:55] ottomata: you here? [18:11:29] ja [18:11:32] dawww [18:11:35] yup [18:12:48] why the heck does it even think that [18:13:37] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [18:14:10] it thinks that 3 minutes in the 15 minute window had over 30 not validated events? [18:14:14] is that right? [18:14:47] i didn't get texted for these though...? [18:14:57] ohh [18:14:58] no, it thinks that 3 points in the last 15 minutes (withj 5 minutes lag) are not equal [18:15:01] but that' because i have no signal at home [18:15:01] haha [18:15:12] not equal? [18:15:23] no the threshold for alerting critical is 30 [18:15:24] right? [18:17:10] k btw, if i remove the absolute from the graph, it dips below 0 at times [18:17:17] i assume just because of the same reason this alert is funky [18:17:30] yeah, it's not absolute the issue, it's lag that should be bigger [18:17:45] because the numbers are reported at slightly different times, so the difference is weird [18:17:48] yeah, maybe larger lag should help [18:17:51] or maybe we should just ditch this alert [18:17:53] altogether [18:18:03] and just rely on EventError or somerthing... [18:18:05] not as good, but [18:18:06] hm [18:18:06] absolute is good --> since we diff values that can erratically be bigger or smaller, we want the absolute [18:18:29] But, looking with marcel, the lag having diff is bigger than 5 minutes: actually more about 8 [18:18:36] i mean i guess so, in that case the movingAverage is better, because the abosluted negatives are realy strange to see [18:18:49] ok i guess so, submit that thar patch and I will merge it:) [18:18:51] make it 10 :) [18:18:58] Awesome [18:19:00] Thanks [18:19:03] worth a try [18:19:13] * joal goes to pupeet [18:22:31] ottomata: https://gerrit.wikimedia.org/r/255556 [18:23:00] joal # At least 3 of the (20 - 5) = 15 readings [18:23:05] is that now 25 - 10 [18:23:05] ? [18:23:15] It is ottomata :) [18:23:54] * joal amends ! [18:24:16] ottomata: done [18:30:52] thanks for merging ottomata [18:31:05] I'm off for now, but will look occasionally to the chan [18:31:13] Have a good end of day a-team :) [18:31:23] bye joal :] [18:34:15] laters! [20:47:22] a-team see you tomorrow! [22:04:04] Analytics-Backlog, Wikimedia-Developer-Summit-2016: Developer summit session: Pageview API overview - https://phabricator.wikimedia.org/T112956#1834472 (Symac) I don't know where is the best place to report this but when I had an average time of 200ms for each request last week, I am mainly today at 1s,... [22:33:40] Analytics-Backlog, Wikimedia-Developer-Summit-2016: Developer summit session: Pageview API overview - https://phabricator.wikimedia.org/T112956#1834516 (JAllemandou) >>! In T112956#1834472, @Symac wrote: > I don't know where is the best place to report this but when I had an average time of 200ms for ea...