[00:01:01] <wikibugs>	 Analytics-Kanban: Restart Pentaho - https://phabricator.wikimedia.org/T105107#1436463 (kevinator) NEW
[00:12:20] <mutante>	 :)
[01:34:48] <wikibugs>	 Analytics-Kanban, Research-and-Data: Validate Uniques using Last Access cookie {bear} - https://phabricator.wikimedia.org/T101465#1436543 (leila) @madhuvishy, Thanks for this.  How hard it is in terms of engineering resources and otherwise to count uniques for a short period of time, say 24 hours, and comp...
[06:07:33] <wikibugs>	 Analytics-Backlog, Wikimania-Hackathon-2015: Dockerize Hadoop Cluster, Druid, and Samza + Load Test - https://phabricator.wikimedia.org/T102980#1436726 (Qgil) a:Milimetric OK, thank you. We are just aiming to have all the confirmed sessions assigned to someone, in order to make it easier for anybody to...
[07:28:52] <joal>	 Thanks jgage for the kafka cluster repair :)
[10:46:33] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1437191 (Qgil) @anomie @bd808 @tgr, is it ok to add this task to the #Reading-infrastructure-team backlog? Not only because this small and b...
[11:14:46] <wikibugs>	 Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1437265 (Krinkle) Open>Resolved >>! In T104277#1432643, @ori wrote: > What issues?  Mainly two things observed: * A (high) presence of H...
[12:06:50] <grrrit-wm>	 (CR) Joal: [C: 2 V: 2] "Looks good to me :)" [analytics/aggregator] - https://gerrit.wikimedia.org/r/223031 (https://phabricator.wikimedia.org/T95339) (owner: Mforns)
[12:41:09] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0]
[12:43:10] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[14:21:03] <joal>	 hey milimetric
[14:21:30] <milimetric>	 hey joal
[14:21:39] <joal>	 About puppet on wikimetrics, any news ?
[14:21:58] <milimetric>	 joal: no, haven't touched it, I just made a task
[14:22:10] <joal>	 k cool
[14:22:11] <milimetric>	 I was going to ping marcel to see if he can take a look
[14:22:22] <milimetric>	 it's in progress now so we can't forget about it anymore
[14:22:28] <milimetric>	 but I've been fighting this insane date logic
[14:22:41] <joal>	 date logic ?
[14:22:45] <milimetric>	 I'm almost done this time, with the 3rd round of unexpected complexity
[14:22:56] <milimetric>	 the wikimetrics local / zoned / global / local default dance
[14:23:21] <milimetric>	 one of the more complicated pieces of code I ever wrote, with almost 0 motivation to write it because it's such such low value
[14:23:28] <milimetric>	 like, it saves 10 people 2 clicks here and there...
[14:23:37] <joal>	 :(
[14:23:44] <milimetric>	 yeah... not good use of time...
[14:23:50] <milimetric>	 but we promised we'd do it and we're a team of our word
[14:23:53] <joal>	 But, when it's done ... :)
[14:25:25] <joal>	 Thanks for the info
[14:54:54] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1437769 (bd808) >>! In T102079#1437191, @Qgil wrote: > @anomie @bd808 @tgr, is it ok to add this task to the #Reading-infrastructure-team ba...
[15:02:06] <mforns>	 joal, hi! you around? I can't find anywhere how to deploy the aggregator, or if it is already deployed?
[15:09:39] <wikibugs>	 Analytics-Kanban: Troubleshoot EventLogging validation alerts - https://phabricator.wikimedia.org/T105167#1437818 (mforns) NEW a:mforns
[15:16:03] <joal>	 hey mforns :)
[15:16:18] <mforns_brb>	 joal, hey!
[15:16:28] <joal>	 Normally aggregator gets deployed automagiccaly by puppet
[15:16:32] <mforns_brb>	 joal, ok
[15:16:34] <mforns_brb>	 cool
[15:16:51] <mforns_brb>	 joal, where does it live?
[15:16:54] <joal>	 I wanted to double check that now, and maybe ahve a manual run to generate all.csv dataset for histoprical data
[15:17:10] <joal>	 We can do it together if you want ?
[15:17:16] <mforns_brb>	 joal, we still need to add the --all-projects flag
[15:18:16] <mforns_brb>	 joal, can it be after standup? I'd like to do something before standup
[15:18:29] <joal>	 mforns_brb: sure :)
[15:18:32] <mforns_brb>	 ok
[15:22:33] <grrrit-wm>	 (PS11) Milimetric: Add global default report fields [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/217857 (https://phabricator.wikimedia.org/T74117)
[15:43:51] <mforns>	 joal, are you done for today?
[15:44:07] <joal>	 nope :)
[15:44:28] <joal>	 let's double check deploy if you want
[15:44:57] <mforns>	 joal, sure! but only if you have time
[15:45:05] <joal>	 Definitely ok :)
[15:45:08] <joal>	 batcave ?
[15:45:10] <mforns>	 batcave?
[15:45:12] <mforns>	 :]
[15:45:14] <joal>	 :)
[15:52:11] <joal>	 jgage: Hi !
[15:56:12] <milimetric>	 goin to lunch yall, bbl
[15:56:42] <joal>	 enjoy milimetric
[15:57:45] <milimetric>	 oh madhuvishy do you wanna go to scrum of scrums today?
[15:57:55] <milimetric>	 or anyone else for that matter
[15:58:07] <milimetric>	 it's in 1.5 hours and I think I'm going to miss it
[15:59:42] <milimetric>	 just in case, if someone goes, I put "nothing to report" in http://etherpad.wikimedia.org/p/Scrum-of-Scrums because I couldn't think of anything to report that's relevant for other teams right now
[16:00:26] <jgage>	 joal: hi :)
[16:00:42] <jgage>	 not sure if we have a meeting today or if we're skipping, i'm fine with either
[16:00:57] <joal>	 andrew being off, do we cancel ops-analytics meeting ?
[16:01:18] <joal>	 jgage: --^
[16:01:20] <jgage>	 i don't really have anything to discuss. kafka 18 & 21 were out yesterday, i triggered a reelection and that was that
[16:02:10] <joal>	 yeah, thx for that
[16:03:04] <jgage>	 sure, let's cancel it. i'm not sure if we'll have one next week either, i think otto will be at wikimania
[16:05:01] <joal>	 yup, he will, and so do it
[16:05:05] <joal>	 will you jgage ?
[16:05:42] <jgage>	 unfortunately no
[16:05:49] <jgage>	 i want to go to the one in italy next year :D
[16:05:57] <joal>	 I understand that :)
[16:06:40] <jgage>	 ok, i will see you in our meeting on july 22 then, unless something explodes
[16:06:42] <joal>	 jgage: we are gonna make a change in puppet, do you mind reviewing ?
[16:06:49] <jgage>	 sure
[16:06:52] <joal>	 jgage: right
[16:07:01] <joal>	 I'll let you know when ready
[16:07:11] <jgage>	 k, just add me as reviewer
[16:07:14] <joal>	 :)
[16:07:17] <joal>	 awesome, thx
[16:08:29] <milimetric>	 joal: vital signs is updated again
[16:08:40] <joal>	 milimetric: you ROCK :)
[16:08:42] <milimetric>	 puppet should run fine but I'll leave the bug in "code review" just so we can check on it again tomorrow
[16:08:48] <milimetric>	 I did nothing, just the basics here: https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster
[16:09:56] <wikibugs>	 Analytics-Backlog, Wikimania-Hackathon-2015: Dockerize Hadoop Cluster, Druid, and Samza + Load Test - https://phabricator.wikimedia.org/T102980#1438045 (Milimetric) Oh I'm happy to be the point of contact, but if anyone is reading this, grab anyone on the list above if you can't find me.
[16:10:45] <wikibugs>	 Analytics-Kanban: Bug: puppet not running on wikimetrics1 instance, Vital Signs stale {musk} [5 pts] - https://phabricator.wikimedia.org/T105047#1438048 (Milimetric) followed directions here [1] and puppet ran cleanly again.  I will leave this in code review so we can monitor and make sure it runs periodicall...
[16:11:20] <joal>	 milimetric: you had to re-setup everything ?
[16:12:05] <milimetric>	 joal: nono, just the update part.  I basically git pull --rebase in both the /var/log/git/operations/puppet repo and the private one
[16:12:14] <milimetric>	 and then re-ran puppet to make sure it's all good
[16:12:23] <joal>	 mforns: before you review the "test data for pageview API" task, I am gonna provide a small analysis on how much k=100 impacts
[16:12:23] <milimetric>	 now it looks enabled so it should run every 30 min. but you never know :)
[16:12:51] <joal>	 milimetric: that's great :)
[16:12:58] <mforns>	 joal, ok, let me know when I can start :]
[16:13:09] <joal>	 I'll double check tomorrow
[16:13:34] * joal thanks milimitric for the beautiful updated chart :)
[16:14:05] <jgage>	 joal: do you happen to know why kafka and jmxtrans are listening on a few random high ports?
[16:14:08] <jgage>	 gage@analytics1012:~$ sudo lsof -i -n -P | grep LISTEN | grep java | awk '{print $9, $3}' | sort -n -k 1.3
[16:14:11] <jgage>	 *:2101 jmxtrans
[16:14:14] <jgage>	 *:9092 kafka
[16:14:16] <jgage>	 *:9999 kafka
[16:14:19] <jgage>	 *:47400 kafka
[16:14:21] <jgage>	 *:51771 jmxtrans
[16:14:24] <jgage>	 *:55286 jmxtrans
[16:14:26] <jgage>	 *:60924 kafka
[16:15:25] <jgage>	 my question is related to this changeset to create firewall rules for those hosts: https://gerrit.wikimedia.org/r/#/c/223534/
[16:23:00] <milimetric>	 ok, really going for lunch now.  I'll be gone for a while 'cause i have an errand to run
[16:48:39] <grrrit-wm>	 (PS1) Mforns: Add clarifying comment on --all-projects behavior [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339)
[16:48:46] <grrrit-wm>	 (CR) jenkins-bot: [V: -1] Add clarifying comment on --all-projects behavior [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339) (owner: Mforns)
[16:53:09] <grrrit-wm>	 (PS2) Mforns: Add clarifying comment on --all-projects behavior [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339)
[16:53:25] <joal>	 jgage: sorry for missing the pings
[16:53:37] <joal>	 jgage: I am not good at jmxtrans conf :(
[16:53:41] <joal>	 can't help really
[17:18:13] <wikibugs>	 Analytics, ContentTranslation-Analytics, MediaWiki-extensions-ContentTranslation, ContentTranslation-Release6: find how much are the articles that are created using ContentTranslation read - https://phabricator.wikimedia.org/T105194#1438288 (Amire80) NEW
[17:26:13] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1438360 (Tgr) IMO we should not reinvent any wheels - API requests are normal web requests, and there is already extensive machinery in plac...
[17:32:19] <grrrit-wm>	 (CR) Joal: Add clarifying comment on --all-projects behavior (1 comment) [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339) (owner: Mforns)
[17:33:23] <madhuvishy>	 milimetric: when you're back we can do the code review for the wikimetrics task. let me know :)
[17:34:27] <grrrit-wm>	 (PS3) Mforns: Add clarifying comment on --all-projects behavior [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339)
[17:34:42] <grrrit-wm>	 (CR) Mforns: Add clarifying comment on --all-projects behavior (1 comment) [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339) (owner: Mforns)
[17:35:10] <joal>	 mforns: I have data for analysis of the K=100 thing, but didn't write the comment in phabricator yet
[17:35:18] <mforns>	 joal, ok
[17:35:19] <joal>	 mforns: I'll do that tomorrow morning :)
[17:35:27] <mforns>	 joal, cool!
[17:35:48] <mforns>	 joal, I have still plenty of things to do :]
[17:35:52] <joal>	 Thanks for the patches, and sorry for forgetting telling you about comment
[17:36:02] <joal>	 mforns: I have no doubt ;)
[17:36:07] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1438403 (Halfak) @tgr if you can find a way to extract an OAuth ConsumerId from our "extensive machinery" around web requests, I'd love to h...
[17:36:26] <mforns>	 joal, np
[17:40:10] <grrrit-wm>	 (CR) Joal: [C: 2 V: 2] "Thanks Marcel :)" [analytics/aggregator] - https://gerrit.wikimedia.org/r/223577 (https://phabricator.wikimedia.org/T95339) (owner: Mforns)
[17:40:41] <mforns>	 joal, thank *you
[17:40:52] <joal>	 No mforns, Thank you !
[17:40:59] <joal>	 :)
[17:44:53] <joal>	 Guys, except if any of you need help from me, I'll leave :)
[17:45:17] <mforns>	 see you tomorrow!
[18:09:28] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1438547 (Tgr) >>! In T102079#1438403, @Halfak wrote: > @tgr if you can find a way to extract an OAuth ConsumerId from our "extensive machine...
[18:15:14] <mutante>	 analytics1021: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.8467799701e-12
[18:15:23] <mutante>	 since 8h 13m
[18:15:48] <mutante>	 is the bot outputting it here ?
[18:15:51] <mutante>	 it should
[18:18:18] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 26.67% of data above the critical threshold [30.0]
[18:19:24] <wikibugs>	 Analytics-Kanban, Research-and-Data: Validate Uniques using Last Access cookie {bear} - https://phabricator.wikimedia.org/T101465#1438589 (ggellerman) @leila Thanks for exploring this possibility
[18:21:19] <mutante>	 well, i see the bot works, so i guess i'm duplicating things
[18:21:42] <mutante>	 assuming people read the bot messages
[18:22:00] <mutante>	 because it happens like every day
[18:22:31] <mutante>	 jgage is fixing, thanks!
[18:22:46] <jgage>	 i've never gotten an answer about those eventlogging alerts, and they go to my phone. i vote for turning them off.
[18:23:13] <jgage>	 oh you pasted the analytics1021 alert, that's different and actually valuable
[18:24:27] <jgage>	 we've just received new hadoop nodes; once those are added to the cluster we'll convert one of the oldest-gen workers into a kafka broker and ditch problematic analytics1021
[18:37:47] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1438689 (Halfak) > Parse the Authorization header  Solid idea.  I wonder if we can also get the user_id/global_id using this type of strateg...
[18:42:48] <wikibugs>	 Analytics, Editing-Department: Investigate drop-off in global edit save rate starting 27 June 2015 - https://phabricator.wikimedia.org/T105215#1438716 (Neil_P._Quinn_WMF)
[18:48:20] <icinga-wm>	 ACKNOWLEDGEMENT - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 100.00% of data above the critical threshold [30.0] daniel_zahn https://phabricator.wikimedia.org/T105216
[19:30:05] <milimetric>	 madhuvishy: I'm free anytime you're free
[19:39:28] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[19:50:59] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 33.33% of data above the critical threshold [30.0]
[20:09:03] <madhuvishy>	 milimetric: just got back from lunch. 2 minutes
[20:15:20] <wikibugs>	 Analytics, Engineering-Community, Research-and-Data, ECT-July-2015: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1439000 (Tgr) >>! In T102079#1438689, @Halfak wrote: > Solid idea.  I wonder if we can also get the user_id/global_id using this type of str...
[20:16:38] <madhuvishy>	 milimetric: batcave?
[20:16:52] <milimetric>	 omw
[20:21:39] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 26.67% of data above the critical threshold [30.0]
[20:49:42] <milimetric>	 jgage: were you and mutante talking about EL above?  I couldn't decipher the opsy talk :)
[20:50:03] <milimetric>	 I'm confused about this alert and the fact that dzahn claimed it and put it under some weird Cassandra umbrella
[21:04:41] <mforns>	 milimetric, are you looking al EL too?
[21:05:13] <milimetric>	 mforns: I just looked at the graphs and realized it's on the varnish side most likely
[21:05:23] <milimetric>	 nothing looks wrong with any of the EL workers
[21:05:25] <mforns>	 aha
[21:05:33] <milimetric>	 so I figure just not enough events are getting throug
[21:05:50] <milimetric>	 but that seems like an ops thing, and I don't really know who to talk to without otto around
[21:06:45] <mforns>	 mmm
[21:13:29] <wikibugs>	 Analytics-Kanban: Restart Pentaho - https://phabricator.wikimedia.org/T105107#1439195 (Milimetric) pentaho is inaccessible via ssh due to stale puppet probably.  I didn't want to waste Yuvi's time with it so I didn't ask.  I rebooted the instance but that didn't solve anything, the service didn't come up.  We...
[21:16:52] <madhuvishy>	 leila: I added to sheets to the spreadsheet on mobile apps uniques. One for Android, and one for iOS
[21:16:58] <madhuvishy>	 its pretty interesting
[21:17:08] <madhuvishy>	 two sheets*
[21:17:19] <leila>	 thanks, madhuvishy. I'll look into it this afternoon. (in a meeting now)
[21:17:28] <madhuvishy>	 leila: sure :)
[21:21:49] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0]
[21:24:02] <wikibugs>	 Analytics-EventLogging, Analytics-Kanban: Can Search up sampling to 5%? {oryx} - https://phabricator.wikimedia.org/T103186#1439244 (Milimetric) FYI this is still blocked on @Ironholds responding to my comment above (https://phabricator.wikimedia.org/T103186#1412355).  Apologies for not pinging him correct...
[21:25:35] <madhuvishy>	 milimetric: In this task - https://phabricator.wikimedia.org/T101465#1436543.. Do you have some idea on leila's last comment? (It's about running a short duration unique token based test for validation)
[21:27:09] <milimetric>	 madhuvishy: I think that's a question for brandon black.  From our end it's easy to write some varnish code, but it's up to him if he'd be ok with deploying it for a single day and then removing it
[21:27:39] <madhuvishy>	 milimetric: alright, I will ask him then
[21:28:13] <milimetric>	 so for us it would just be - set a one time cookie with a 1 day expiration date and a unique id.  hm... well, that might be tricky in varnish too i'm not sure
[21:28:21] <madhuvishy>	 milimetric: I don't think we need to do global uniques - cos last access doesn't give us global counts - so may be even doing one domain would do
[21:28:25] <milimetric>	 but the code complexity isn't high
[21:28:41] <milimetric>	 ok, cool, you should mention that 'cause it'll keep the exposure low
[21:30:10] <wikibugs>	 Analytics-Kanban, Analytics-Visualization: Update Vital Signs UX for aggregations {musk} [13 pts] - https://phabricator.wikimedia.org/T95340#1439266 (Milimetric) a:Milimetric
[21:41:23] <wikibugs>	 Analytics-EventLogging, Analytics-Kanban: Can Search up sampling to 5%? {oryx} - https://phabricator.wikimedia.org/T103186#1439313 (Ironholds) It was probably more to do with the fact that, as my OOO email made clear, I've been afk for a week ad a half. I'll try to address this tomorrow (just clearing my...
[21:43:16] <mforns>	 EventLogging graph is going back to normal...
[21:48:34] <madhuvishy>	 milimetric: thanks, sent email
[21:51:05] <milimetric>	 mforns: yeah, i just saw that, did you do anything?
[21:51:10] <mforns>	 milimetric, no...
[21:51:26] <milimetric>	 :) great
[21:51:33] <milimetric>	 self-healing
[21:51:44] <mforns>	 I was just checking the db, the inserts were normal during all this time, it seems a graphite problem to me
[21:51:50] <mforns>	 hehe, yea
[21:51:58] <milimetric>	 maybe the reporter then...
[21:52:43] <mforns>	 milimetric, I checked the graphite-consumer in hafnium and it was fine
[21:54:06] <milimetric>	 did you check the eventlogging logs there?
[21:54:09] <milimetric>	 i'll take a look
[21:56:33] <mforns>	 milimetric, yes, the graphite consumer log just contains the init log
[21:57:14] <milimetric>	 the /var/log/upstart/eventlogging_* logs haven't been updated in forever
[21:57:20] <mforns>	 but it seems strange to me that in eventlog1001, the reporter logs stop like 2 days ago
[21:57:37] <mforns>	 I can see the wrapped up logs, but not the current log
[21:57:52] <milimetric>	 yeah, but the zipped logs are from weeks ago
[21:58:03] <mforns>	 you mean in hafnium?
[21:58:06] <milimetric>	 yea
[21:58:09] <mforns>	 aha
[21:58:29] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0]
[21:58:29] <milimetric>	 just weird - i think we have no visibility in this.  Obviously something bad happened in opsy world
[21:58:32] <mforns>	 and in eventlog1001 the same happens with the reporter logs
[21:58:36] <milimetric>	 yea
[21:58:55] <milimetric>	 I asked on the ticket that this got added to, we'll see if they say anything
[21:59:30] <mforns>	 milimetric, I assumed that in hafnium, the graphite consumer log, does not log anything but the initialization
[21:59:54] <milimetric>	 oh, that would make sense
[22:00:02] <milimetric>	 i figured it'd restart more often but maybe not
[22:00:05] <mforns>	 so if the process does not restart for a long time, no logs are written, and the only thing that remains is old gzipped logs
[22:00:11] <mforns>	 this would make sense to me
[22:00:30] <mforns>	 but the reporter logs do print stuff other than the init
[22:16:20] <wikibugs>	 Analytics-Backlog, Compact-Personal-Bar-(Beta): Delete all data from EventLogging:PersonalBar schema - https://phabricator.wikimedia.org/T105065#1439406 (gpaumier)
[22:17:41] <mforns>	 milimetric, somehow I can not match any of the alerts with any anomaly in graphite graphs...
[22:32:11] <milimetric>	 mforns: they seemed to happen at the same time as the big drop in the overall.raw.rate
[22:32:14] <milimetric>	 did you look at that one?
[22:32:47] <mforns>	 milimetric, yes, I could not find any matching drop
[22:33:07] <milimetric>	 lemme look at the alerts again
[22:33:09] <mforns>	 there are lots of drops, but could not match them with the alert timestamp
[22:34:01] <mforns>	 milimetric, if you want to look at Jul 5th 20:54:49 - 20:58:30 UTC
[22:34:14] <mforns>	 I found some missing events in the db for this period
[22:34:38] <mforns>	 the graphite charts are intact however...
[22:37:32] <milimetric>	 mforns: http://graphite.wikimedia.org/render?from=16%3A00_20150708&until=22%3A00_20150708&width=400&height=250&target=eventlogging.overall.raw.rate&_uniq=0.9958936583716422&title=eventlogging.overall.raw.rate
[22:37:49] <milimetric>	 that drop corresponds with the icinga emails
[22:38:23] <milimetric>	 a little bigger: http://graphite.wikimedia.org/render?from=16%3A00_20150708&until=22%3A00_20150708&width=1200&height=650&target=eventlogging.overall.raw.rate&_uniq=0.9958936583716422&title=eventlogging.overall.raw.rate
[22:38:51] <mforns>	 milimetric, wait but this drop is the one that just happened a while ago
[22:39:15] <milimetric>	 a while ago?  It's today
[22:39:15] <mforns>	 I think this is of a different nature of the others, that lasted for 2 minutes
[22:39:19] <milimetric>	 oh yea
[22:39:34] <milimetric>	 you're looking for the 2 minute ones?
[22:39:35] <mforns>	 yes, I meant just now
[22:39:38] <mforns>	 yes
[22:39:45] <milimetric>	 oh there's a new 2 minute one?
[22:40:09] <milimetric>	 I'm not seeing one, yea
[22:40:22] <mforns>	 xD, no no. With "a while ago" I meant today like 40 mins ago
[22:40:35] <mforns>	 I'm looking at Jul 5th 20:54:49 - 20:58:30 UTC
[22:40:47] <mforns>	 this one is the biggest one, 4 minutes
[22:41:12] <mforns>	 and I found a hole in the database (missing events) in those minutes
[22:41:19] <milimetric>	 yeah, the alarm seems wrong, this part looks ok to me: http://graphite.wikimedia.org/render?from=22%3A00_20150708&until=23%3A00_20150708&width=1200&height=650&target=eventlogging.overall.*.rate&_uniq=0.9958936583716422&title=eventlogging.overall.raw.rate
[22:41:23] <mforns>	 but graphite chart is intact
[22:42:20] <milimetric>	 mforns: wait i'm not seeing a problem alert 40 minutes ago
[22:42:25] <milimetric>	 just the recovery message from icinga
[22:42:46] <milimetric>	 the last problem alert I have is 21:21
[22:42:59] <milimetric>	 sorry - july 5th
[22:43:43] <milimetric>	 oh i get the confusion, i thought you said you meant "just now" for the 2 minute alerts
[22:43:58] <mforns>	 oh, no no
[22:44:00] <milimetric>	 bah, heh :) looking back in time now
[22:45:16] <milimetric>	 mforns: this is the one I see close to that time: http://graphite.wikimedia.org/render?from=19%3A00_20150705&until=20%3A00_20150705&width=1200&height=650&target=eventlogging.overall.*.rate&_uniq=0.9958936583716422&title=eventlogging.overall.raw.rate
[22:45:38] <milimetric>	 looks like 19:31 to 19:33
[22:46:01] <milimetric>	 looks like some sort of hiccup
[22:46:03] <mforns>	 milimetric, ok, but that's like 1.5 hous before the alert
[22:46:09] <milimetric>	 yea
[22:46:16] <milimetric>	 so you're right, no match
[22:46:22] <milimetric>	 icinga was confused somehow
[22:46:49] <mforns>	 milimetric, do you know from where exactly icinga gets the metrics?
[22:46:57] <milimetric>	 graphite
[22:47:00] <mforns>	 mmm
[22:47:28] <milimetric>	 so theoretically if you plug in the numbers you get from that graph into whatever math icinga does, it should say yes throw an alert
[22:47:35] <milimetric>	 but the data might have changed
[22:47:41] <milimetric>	 remember how graphite sometimes updates late and weird
[22:47:45] <mforns>	 aha
[22:47:48] <mforns>	 this makes sense
[22:47:51] <milimetric>	 that could've happened here plugging up the hole
[23:02:19] <madhuvishy>	 milimetric: this balanced consumer stuff is working great!
[23:02:27] <madhuvishy>	 no load tests yet but i love it
[23:03:11] <milimetric>	 that's cool.  Yeah, I think adding something to Event Logging should feel like hitting the perfect sweet spot in tennis
[23:03:13] <madhuvishy>	 milimetric: http://i.imgur.com/o7MFD4e.png
[23:03:13] <milimetric>	 simple and powerful
[23:03:26] <milimetric>	 :) pretty - i love kafka so smart
[23:03:43] <madhuvishy>	 https://www.irccloud.com/pastebin/FRiq2DOE/
[23:03:50] <madhuvishy>	 I pushed 200 events
[23:03:59] <madhuvishy>	 through the forwarder
[23:04:07] <madhuvishy>	 milimetric: :) yeah
[23:04:58] <milimetric>	 that's surprisingly good balancing for such a low number of events
[23:05:20] <milimetric>	 i'd expect it to be good on average as the number of events gets higher, but this is great
[23:10:21] <madhuvishy>	 milimetric: yup :)
[23:10:50] <wikibugs>	 Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1439579 (Catrope) As requested by @ori and @Krinkle, here's my wishlist of statistics to collect:  1. Collect # of requests/responses for each...
[23:23:38] <wikibugs>	 Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1439658 (Krinkle) For @Catrope's #1 wishlist item I'd recommend we fragment the existing metrics by an additional layer. I don't think we need...
[23:53:41] <wikibugs>	 Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1439760 (ori) Ok. Can you formulate a plan for how you'd act on this data? i.e., if the data showed X, we'd do Y; if it instead showed P, we'd...
[23:55:08] <wikibugs>	 Analytics, Analytics-Backlog, Performance-Team, Patch-For-Review: Collect HTTP statistics about load.php requests - https://phabricator.wikimedia.org/T104277#1439765 (ori) (And also: can you say what the goals are, in terms of cache hit rates, or some other metric? Identify what ought to be possible...