[11:22:02] (CR) QChris: Adding coding guidelines to README.md file (5 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [16:04:52] hey qchris_away, real quick, just confirming, kafkatee files look ok now, ja? [16:24:09] ottomata: text is flowing into kafka? [16:29:14] yup! :) [16:29:19] and into hdfs + hive [16:36:59] all hail the ottomata! [16:42:30] woohoo [16:43:13] wow -- ottomata: you just built a log ingestion system for the 5th largest web site on the planet [16:43:22] that's pretty cool [16:44:22] I would have expected the rps to be higher -- it's peaking at 70K/sec? [16:44:50] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=cp.%2B&mreg[]=kafka.rdkafka.topics..%2B%5C.txmsgs.per_second&z=large>ype=stack&title=kafka.rdkafka.topics..%2B%5C.txmsgs.per_second&aggregate=1&r=week [16:46:45] hm interesting [16:46:47] this shoudl be equivalent [16:46:47] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics102[12].*&mreg[]=kafka.server.BrokerTopicMetrics.%2B-MessagesInPerSec.OneMinuteRate&z=large>ype=stack&title=kafka.server.BrokerTopicMetrics.%2B-MessagesInPerSec.OneMinuteRate&aggregate=1&r=hour [16:47:21] I prefer your color scheme -- my link is an old bookmark -- should I replace it? [16:47:51] both are good, just from different sides [16:47:55] this is best [16:48:02] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=kafka&hide-hf=false [16:48:04] bookmark that [16:48:07] easiest to look at [16:48:13] that is data from kafka jmx [16:49:44] oh, tnegrin, i think there are hosts missing from the varnishkafka view [16:49:59] np [16:50:00] yes of course [16:50:02] thanks, can fix [16:50:48] I'm fascinated -- why is there such a bump in bytes during the last few hours? [16:54:00] this is better, http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=&vl=&x=&n=&hreg[]=%28amssq%7Ccp%29.%2B&mreg[]=kafka.rdkafka.topics.webrequest.%2B.txmsgs.per_second>ype=stack&glegend=show&aggregate=1&embed=1&_=1398271937576 [16:54:04] for varnishkafka side [16:55:20] wow -- I think the acid just kicked in ;) [16:55:27] haha [16:55:40] that's the manager view then? [16:55:54] naw, stick with that kafka view link I gave you [16:55:59] that is the easiest to look at [16:56:10] i'm fixing the varnishkafka side so that it has all the hosts in it [16:56:30] ok -- good deal [16:56:47] let the analysis begin! [16:57:00] but, whatcha mean bump? high load time is normal bump time, ja? [16:57:18] that's actually one reason that the varnishkafka view is kinda cool [16:57:25] you can see the lines from the different timezones fluctating [16:57:39] since each metric is from a different host, some in sf, some in va, some in ams [16:58:24] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&hreg[]=analytics102%5B12%5D.%2A&mreg[]=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate>ype=stack&title=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&aggregate=1 [16:58:37] yes -- that is cool [16:59:11] in bytes or messages out, you can see camus running hourly [16:59:21] that's out, not in [16:59:21] got it [16:59:45] very interesting -- so it takes camus about 30 mins to run? [17:00:09] yup [17:00:27] so 30 mins to copy 1 hours worth of data [17:01:05] so if we fall behind, we catch up at 2x speed -- scary [17:01:08] I love big data [17:03:37] ottomata: Yes, the kafkatee files look fine now (one hiccup yesterday around 18:00 UTC). [17:03:54] But as I said on the bug ... how to guard against this in future? [17:04:02] Might that just happen again to us? [17:04:09] canary? [17:04:17] tnegrin: :-D [17:04:34] seriously -- I think that's the only way to know for sure [17:04:52] I think the canary system will be great. [17:05:12] However, I am not sure ... will we have it done in the next half year? [17:05:21] (Not meant to be sarcastic) [17:05:32] qchris, I just got this working: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=kafkatee&hide-hf=false [17:05:45] not sure if it will help or not, but if it happens again maybe we can see what the problem is there [17:05:48] uh, talk to the product owner [17:05:53] (dunno what that big bump is there...) [17:06:05] tnegrin: :-P [17:06:19] qchris: I think we can actually [17:06:35] tnegrin: Awesome! [17:06:41] uh oh [17:06:44] Then yes. Let's do the canary thing. [17:06:45] did I just make a commitment [17:06:59] :-D [17:07:05] we need to be honest about the operational status of this system [17:07:11] people want data and we can give it to them [17:07:33] but there is a ways to go before we can commit to uptimes/consistency and so forth [17:07:43] we will be clear about our capabilities [17:08:04] I don't want to back into high SLAs without the necessary infra [17:08:04] Yes. It will take some time to build. [17:08:12] :-) [17:09:25] ottomata: Those graphs look great! I still lack understanding of them. ... but yes, they will help. [17:30:07] ottomata: why is next_offset.per_second so low? averaging around 5-6kmsgs/s [17:30:31] should be 70kmsgs/s, yes no yes? [17:30:42] (e.g., symetric to the producer side of things) [17:31:00] its just mobile [17:31:08] kafkatee is just consuming mobile right now [17:39:46] ah, okay [17:40:37] ottomata: so if kt provided its own stats counting the number of processed messages we could rule out any drops between broker and kt by comparing to next_offset. [17:42:33] (CR) Milimetric: [C: 2] Update the ssh script [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/127841 (owner: Jdlrobson) [17:43:14] ja that would be cool, Snaps [17:44:46] (CR) Milimetric: [C: 2] "I didn't test the graph and datasource but I promise to fix if my merge breaks things :)" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/127842 (owner: Jdlrobson) [21:12:27] (PS5) Milimetric: Fix user name display in CSV files [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129025 [21:12:35] ew ^ [21:13:05] had to write commented out tests because puppet would need to be updated first before I can properly protect against the bug we saw (from tests) [22:59:31] (PS1) JGonera: Update MobileWebUploads schema tables [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/129319