[05:45:21] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526649 (awight) > That probably made sense in 2006, when the article that SO post is based on was... [05:56:47] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526651 (BBlack) Even if browsers allow >2K URLs, they seem like a poor idea in general. Even a 1... [06:44:47] Analytics-Wikistats: Adding Odia (Oriya) Wikisource to Stats Wiki - https://phabricator.wikimedia.org/T108012#1526669 (psubhashish1) [[ https://phabricator.wikimedia.org/p/Aklapper/ | @Aklapper ]]: Sorry for that. I was not aware as I was reporting a bug of this kind for the first time. Thanks for helping ou... [07:15:34] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526691 (Tgr) So why don't we just use POST? `sendBeacon` actually does that, we just abuse it cur... [07:31:52] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526718 (BBlack) Probably because beacon is used with the analytics pipeline rather than the appse... [09:16:10] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1526787 (Aklapper) p:Normal>High [09:16:16] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1526789 (Aklapper) p:High>Normal [09:21:46] Analytics, MediaWiki-API, Reading-Infrastructure-Team: Load API request count and latency data from Hadoop to a dashboard (limn?) - https://phabricator.wikimedia.org/T108414#1526797 (ArielGlenn) [09:25:12] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429335 (Aklapper) Discussed in our meeting: * Bitergia provides the automated process to gather data from Git/Gerrit and have a list upda... [09:36:54] Analytics-Tech-community-metrics, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1526816 (Qgil) Resolved>Open Sorry, I have to reopen. In July, FSF has -2 and Deutsche Telekom has -1. This means that the... [09:37:09] Analytics-Tech-community-metrics, ECT-August-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1526819 (Qgil) [09:46:19] Analytics-Tech-community-metrics, ECT-August-2015: Jenkins-mwext-sync appears in "Who contributes code" - https://phabricator.wikimedia.org/T105983#1526830 (Qgil) a:Dicortazar [09:47:57] Analytics-Tech-community-metrics, ECT-August-2015: Tech metrics should talk about "Affiliation" or "Organization" instead of companies - https://phabricator.wikimedia.org/T62091#1526841 (Aklapper) [09:48:13] Analytics-Tech-community-metrics, ECT-August-2015: Tech metrics should talk about "Affiliation" or "Organization" instead of companies - https://phabricator.wikimedia.org/T62091#659085 (Aklapper) (updated summary as per last comments) [09:51:26] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1526845 (Aklapper) [12:45:58] Anyone know who the phabricator admins are? [12:46:02] greg-g, you look plausible [13:44:26] morning ottomata [13:44:31] I've been looking at the loss [13:44:50] it looks like we have duplicates on a few hosts, and loss on cp1008 [13:45:07] cp1008.wikimedia.org, that is [13:45:40] morning [13:45:47] cp1008.w sounds like a weird host [13:45:54] i am looking at lots of stuff too [13:45:59] let's sync up in a bit [13:46:02] with joal too [13:46:08] Heya [13:46:15] give me 5 minutes please :) [13:47:26] k [13:48:52] batcave ? [13:49:26] ja i need more than 5 [13:55:41] Analytics: Transform to XML-->JSON in sorted file format - https://phabricator.wikimedia.org/T108684#1527274 (Halfak) NEW [13:56:52] FYI joal, https://phabricator.wikimedia.org/T108684 [13:57:10] ottomata: cp1008 gets a tiny number of requests, the diff is like 4 events, 6 events 16 events, etc. [13:58:02] halfak: Thanks ! [13:58:20] Will move that in our boards :) [13:58:23] halfak: --^ [13:58:31] :D [13:58:46] Analytics, Analytics-Backlog: Transform to XML-->JSON in sorted file format - https://phabricator.wikimedia.org/T108684#1527285 (JAllemandou) a:JAllemandou [13:59:20] halfak: I move the old one to done [13:59:24] With a comment [13:59:38] ok ready for batcave! [14:03:04] Analytics-Cluster, Analytics-Kanban: Read wiki dumps in Spark {hawk} - https://phabricator.wikimedia.org/T92385#1527305 (Halfak) I just talked to @JAllemandou and it looks like he's gone as far as he can with spark. We're able to read and extract JSON from XML dumps at high speed, but we're not able to u... [14:17:09] joal: http://grafana.wikimedia.org/#/dashboard/db/kafkatest [15:01:51] Ironholds: releng, generally. Mukunda (twentyafterfour ) and andre specifically [15:02:07] (andre isn't part of releng, but whatevs) [15:02:11] greg-g, thanks! It got resolved :) [15:31:42] halfak: no update on my side, except if you need me, I'll skip the altiscale meeting [15:32:00] joal, could you respond to Soam [15:32:05] 's email re. use of spark? [15:32:10] I will ! [15:32:17] Otherwise, I'll kill the meeting. [15:49:51] (CR) Milimetric: "I think the idea with Sunday was that data would be ready to look at on Monday when people got back to work. That seems like a good enoug" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/230649 (https://phabricator.wikimedia.org/T108593) (owner: Mforns) [16:00:23] (PS2) Mforns: Add support for weekly frequency and granularity [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/230649 (https://phabricator.wikimedia.org/T108593) [16:13:45] halfak: https://twitter.com/CompSciFact/status/631129335471190016 [16:13:55] halfak: makes me think of our work :) [16:14:14] :D [16:14:16] +1 [16:18:11] joal: still going to lunch soon, but you are totally right about kafka log file sizes [16:18:20] 2015-08-10T08 53151892928 [16:18:20] 2015-08-10T09 55836248198 [16:18:20] 2015-08-10T10 57984054557 [16:18:20] 2015-08-10T11 63353197416 [16:18:20] 2015-08-10T12 68184938548 [16:18:21] 2015-08-10T13 69259218741 [16:18:21] 2015-08-10T14 79567698089 [16:18:22] 2015-08-10T15 133643184876 [16:18:22] 2015-08-10T16 168515916825 [16:18:23] 2015-08-10T17 181394338213 [16:18:31] thats' summed logs sizes grouped by hour [16:21:09] ottomata: let me know when you're back from lunch [16:25:40] Analytics-EventLogging, Analytics-Kanban: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1527778 (mforns) a:mforns [16:44:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [16:46:23] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [17:01:26] joal: see the message about snappy? Do you know what version we're using? [17:04:07] aha! https://github.com/wikimedia/operations-debs-kafka/blob/a63cb8805fbdc80a053344d599e302e17eefa9e0/debian/source/include-binaries#L24 [17:04:16] so that's the problem then, maybe!! :) [17:06:00] we need to update to 1.1.1.7 and magic should happen and we should all be happy and drink ginger beer [17:06:05] https://issues.apache.org/jira/browse/KAFKA-2189 [17:19:27] milimetric: That looks to the point indeed ! [17:20:28] We need ottomata to actually havethe magic happen, but it seems nailed down :) [17:33:58] joal: meeting? [17:34:07] oops, yeah arriving [17:38:13] ottomata: see the snappy upgrade issue? [17:38:16] https://issues.apache.org/jira/browse/KAFKA-2189 [17:38:20] we need 1.1.1.7 [17:38:21] no tyet [17:38:22] not yet [17:38:26] just signing on [17:38:28] it looks very promising [17:38:32] someone replied to your message [17:39:18] We found that this is caused by messages being seemingly recompressed individually whoa [17:40:33] yeah :) [17:41:17] intersting, ok. the compression is done on the producers though, hm. [17:41:23] not with snappy java [17:41:31] we did upgrade snappy versions on the brokers [17:41:39] but only to 1.1.1.6 right? [17:41:42] yes [17:41:58] the one guy mentioned they saw the improvement from 1.1.1.6 yo 1.1.1.7 [17:41:59] i don't have a full understanding of how all that works, so its worth a try on one kafka broker to see if log sizes change [17:42:01] right [17:42:03] but [17:42:07] he's using the java clients [17:42:11] to compress the messages [17:42:23] the snappy compression is done by librdkafka via varnishkafka [17:42:28] before the messages get to the brokers [17:42:32] i *think* [17:42:33] pretty srue. [17:42:46] hm... and they don't decompress them all the way to hdfs? [17:42:53] consumers decompress [17:42:59] it should be stored as compressed, i think [17:43:01] but i'm not 100% [17:43:14] and its pretty easy to try on a singel broker and see [17:43:20] can drop in the jar and restart [17:43:25] cool, yeah [17:43:58] but maybe we need to upgrade the snappy varnishkafka uses too? [17:46:36] Analytics-EventLogging, Analytics-Kanban: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1528116 (mforns) **Current hypothesis:** The metric's poor sync creates the false alerts. //The raw vs validate... [17:47:55] hm, unlikely milimetric, libsnappy hasn't changed there [17:52:57] ok, i'm going to turn camus off again to let teh ISRS be good before I try to restart a broker [18:06:08] ottomata: cave ? [18:08:40] joal: sure my internet is kinda slow at this cafe [18:08:47] ottomata: np, later [18:08:53] I'd like to follow on kafka [18:09:02] need to take of the baby though for a moment [18:09:06] will be back soon [18:10:04] ok [18:12:06] Analytics-EventLogging, Analytics-Kanban: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1528316 (mforns) I think the simplest option is to **modify the percentage of data points needed to trigger the... [18:12:48] oof, milimetric, joal, replcas are not catching up, even with camus off now. [18:12:52] i think traffic is too high for them. [18:13:19] i can't turn a broker off now beacuse each partition in text/upload only has one leader [18:13:32] wanted to restart one with new snappy lib [18:13:34] hm. [18:13:51] could possibly bring new brokers into the mix, and move partitions over? [18:13:55] with the new snappy version. [18:14:04] maybe just add one, and try to move a single partition? [18:16:09] haha, i want to do this. someone come tell me it isn't insane! [18:17:34] ottomata: um... why not just turn camus back on? [18:17:44] i just turned it off again [18:17:45] you mean? [18:17:46] i did. [18:18:02] turning on camus makes things lag more [18:18:03] i'm confused, cave? [18:18:06] ja [18:18:30] brt [18:19:30] dan, am going to switch to phone internet, cafe internet too slow [18:20:17] k [18:20:43] :) hi joal, we're about to hang out [18:20:47] :) [18:21:41] yargh phone is really slow too! [18:21:47] 4G! where's my LTE?! [18:21:47] gahhh [18:23:07] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1528355 (mforns) If we merge that, we should test that the alert is still working by for ex... [18:31:56] so if the replicas were less busy serving camus [18:32:02] then they'd have more time to catch up [18:32:05] and become back in sync [18:32:45] so possible idea: turn camus off for like 10 minutes, see if the replicas are doing better, and then restart a broker with the new snappy [18:36:55] ottomata2, if we wanted maps data in HDFS (it's streaming through kafka but apparently not in Hadoop?) who would we tag on that ticket? [18:37:10] like, is this for the maps engineers, analytics engineers, ops generally, analytics ops specifically, maps ops specifically..? [18:37:36] Ironholds: me, or joseph, but i wouldn't count on it at the moment, we are havin gserious kafka issues after this upgrade, trying to fix things [18:37:40] i have camus turned off at the moment, trying to figure this out [18:38:00] i made a patch to do it, but fixin gthis is #1 priority [18:38:18] ottomata2, eep. Makes sense! [18:38:40] good luck! [18:38:48] Ironholds: you can tag Analytics-Backlog and it'll always come to our attention, we groom weekly [18:39:12] yurik: ^^^ [18:39:20] milimetric, great :) [18:40:01] thanks! [18:41:27] Analytics-Backlog: Stream maps cluster requests into HDFS - https://phabricator.wikimedia.org/T108717#1528428 (Ironholds) NEW [18:41:39] Analytics-General-or-Unknown: Statistics for Wikidata API usage - https://phabricator.wikimedia.org/T64873#1528435 (Addshore) a:Addshore>None [18:42:41] (PS2) Yurik: Make camus import webrequest_maps from new maps varnish cluster [analytics/refinery] - https://gerrit.wikimedia.org/r/230535 (https://phabricator.wikimedia.org/T105076) (owner: Ottomata) [18:46:51] Analytics-Backlog, Discovery-Maps-Sprint, Patch-For-Review: Stream maps cluster requests into HDFS - https://phabricator.wikimedia.org/T108717#1528457 (Tfinc) [18:53:27] ottomata2: we can just talk in IRC, phone's a bit hard to hear anyway [18:53:52] yeah [18:53:53] k [18:54:05] so if you refresh http://grafana.wikimedia.org/#/dashboard/db/kafkatest [18:54:09] i added a text-7 log size graph on the lower right [18:54:21] hmm, is that mesages or bytes? [18:54:24] lemme see.. [18:55:44] hm not sure [18:56:06] if it was bytes, then if this bug is the culprit, we will see an22's size for this partition go down over time [18:56:13] maybe for all of them thouhg, since an22 is the leader [18:56:22] no idea how this snappy stuff would be relevant relaly though [18:56:24] i'm going to reply to that emamil [19:00:12] well, ottomata the messages in per second and bytes out per second dropped by a similar factor on an22 [19:00:20] so we can keep comparing that as it catches up [19:00:31] and what we're looking for is the network dropping and the messages rising back to normal, right? [19:01:56] yes, that and smaller log size [19:02:26] I am looking at disk read/write for 1022 [19:02:34] see if it changes [19:03:23] yep, we gotta give it some time for all those things, while it catches up [19:05:57] ha, actually, snappy support is built in with librdkafka, doesn't rely on external c lib [19:06:01] https://github.com/edenhill/librdkafka/blob/master/src/snappy.c [19:06:21] hum [19:06:47] i was double checking snappy stuff there, and libsnappy1 isn't installed, and iwas liek UhhHHh how did this ever work. but, thats how! :)_ [19:06:58] right [19:07:08] line 6 makes me fear :) [19:08:58] sorry, my read/write graphs aren't good in that dash joal [19:09:02] they are for only one broker [19:09:11] ottomata: I added my own :0 [19:09:14] will add other brokers, i think i can sum them [19:09:15] ok nice! [19:09:37] no drop in disk write [19:15:31] seems like a small improvement to me [19:15:46] because network out has gone down by about maybe 60% [19:16:00] while messages in is only down around 35% [19:16:30] Guys, need to get diner ! [19:16:48] Will be back after [19:17:09] bon apetit [19:17:21] ottomata: unless you think this change was harmful, which it doesn't seem to have been, I'm for doing it with the other brokers [19:17:36] are the partition leaders roughly balanced now? [19:18:04] no, but they are very very slightly better [19:18:10] but that could be jsut because we aren't runnign camus [19:18:19] network read is going to go down by a lot because we aren't running camus [19:18:36] i want to see a change in log files sizes [19:18:46] that's the only thin gthat will really convince me, because all the other stats ahve lots of stuff flowing through them [19:19:22] milimetric: i think the stas you are looking at are down because when we restarted an22, it lost leadership for lot sof partitions [19:20:03] you can see that manifested in the messages in per second at the top of the dash [19:20:05] i'm looking at the proportion drop in messages in compared to network out [19:20:11] an22 is handling a smaller proportion [19:20:15] ok [19:20:18] so what i'm saying is messages in went down by around 35% [19:20:25] but network out went down by a higher percentage [19:20:32] meaning maybe it's a good sign [19:22:08] hm ok [19:22:14] weird that the log size on 1018 went down when you bounced 1022... [19:22:18] welllllll but network out also includes replication from 2 other brokers [19:22:56] ah, cool, so then the small difference in relative drops is even better [19:22:58] milimetric: i think that's because of the truncation [19:23:03] 1018 was also following 1022 [19:23:03] oh ok [19:23:08] makes sense [19:23:13] looks like 12 became the leader [19:23:15] and it was behind [19:26:26] milimetric: i think that truncation was a lot of data [19:26:37] few hours worth for those partitions [19:26:41] oh, wow [19:27:06] but most of that was already ingested via camus right? [19:27:13] except the last 20 minutes when it was not running? [19:27:35] not sure, camus was lagging too because we had turned it off for a bit [19:28:28] before you restarted 1022, jo said camus had finished running [19:28:48] can it finish before ingesting everything? Does it have a max ingestion size or something? [19:29:53] Hey ottomata, milimetric! If u have a sec, got some more quick questions here... Did I understand correctly yesterday that it's possible to send data to Kafka directly from the browser? (Just looking in the logs of our chat, but I couldn't find it...) [19:29:54] Analytics, Reading-Web: make MobileWebUIClickTracking schema usable - https://phabricator.wikimedia.org/T108723#1528745 (Jdlrobson) [19:30:00] Analytics, Reading-Web: make MobileWebUIClickTracking schema usable - https://phabricator.wikimedia.org/T108723#1528729 (Jdlrobson) We already sample. We can sample more but this is going to be a lot of data regardless of what we do. The main issue you are having with sampling is purging is not happening... [19:30:33] milimetric: i think it has a max run time [19:30:41] and will quit and let a new instance start up [19:30:44] AndyRussG: yes, it's possible, but right now we're still messing with some prod issues. I'll side-chat you :) [19:31:03] # Max minutes for each mapper to pull messages (-1 means no limit) [19:31:03] # Let each mapper run for no more than 55 minutes. [19:31:03] # Camus creates hourly directories, and we don't want a single [19:31:03] # long running mapper keep other Camus jobs from being launched. [19:31:03] kafka.max.pull.minutes.per.task=55 [19:31:09] k, hm [19:31:21] AndyRussG: not from the browser, no, but maybe! [19:31:24] so we'll see that in the webrequest stats [19:31:56] you won't be able to 'send directly to kafka' from the browser, but the new event system will support something like this. probably in the context of eventlogging, but maybe not. [19:32:30] ottomata: what about posting something and having varnish-kafka send it to kafka? [19:32:59] AndyRussG: that is possible, and how eventlogging on kafka will work soon. [19:33:04] not POST though, since eventlogging is query data [19:33:22] haven't tried post with varnishkafka, dunno if/how that is available to the varnishlog api, but i would think it would be [19:34:04] milimetric: it is still yet hard to know for sure, but i thikn logs are smaller. [19:36:12] i think we'll have to wait until this hour is out for a good comparison. [19:36:15] Analytics-Kanban: Check and potentially timebox limn-flow-data reports {tick} [5 pts] - https://phabricator.wikimedia.org/T107502#1528764 (mforns) a:mforns [19:37:18] yeah, ottomata, the messages in per second recovered on 22 and the network out stayed low [19:37:43] hm also, milimetric, the partitions for which 22 is the leader have a full ISR now [19:37:46] that is a good sign [19:37:46] that makes me fairly optimistic about trying this with other brokers. Though what you said about the data loss is sad [19:37:53] cool [19:41:32] milimetric: i'm fairly certain this is working. [19:41:50] i'm comparing 2 upload partitions [19:42:00] one has 22 as an in sync replica, but leader is 18 [19:42:06] the other has 22 as the leader [19:42:40] the one where 22 is the leader is 1/4 the size of the one where 18 is hte leader [19:42:50] which, is the compression ratio i expect from snappy [19:43:09] snappy compresses our json about 25% of orignal size [19:43:16] makes sense to me [19:43:24] at least, it did when i checked that out years ago :) [19:43:32] that's the same thing I'd guess from this relative percent calculation i'm doing [19:43:37] so um, ok. the right thing to do is to actually build a new kafka package with this version [19:43:58] yes [19:44:02] i think i want to do that rather than manually apply this again [19:44:15] because broker restarts hurt righ tnow, and I'd rather the restart we do just have it bring in the right thing [19:44:19] without having to do a restart later... [19:44:20] hm [19:44:28] although, we are planning on upgrading these to jessie [19:44:32] one everything settles [19:46:26] so, i could just replace the jar now and restart brokers and keep fingers crossed. [19:46:29] naw, lemme rebuild... :) [19:47:07] yeah, more sanity is best here I think [19:47:29] we can retrace our steps easier this way too [19:48:54] key part from the email response ottomata: "load generator" [19:49:00] they have a load generator!! :) [19:49:10] lucky dogs [19:49:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [19:49:27] its not just load, milimetric. i think you are right that having an awesome staging env for kafka and hadoop will be very useful [19:49:42] i just think it will take a lot of work to build something that will be a useful comparison to prod [19:49:47] not somehting we can jjust spin up in labs in an afternoon righ tnow [19:49:51] yes [19:50:07] but worthwhile, this environment is complex enough that we'll have these issues again [19:51:00] Analytics, Reading-Web: make MobileWebUIClickTracking schema usable - https://phabricator.wikimedia.org/T108723#1528810 (bmansurov) I agree, we should aggregate the old data and then purge it. [19:51:34] yeah [19:53:24] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [20:13:23] milimetric: this morning, did you say, you wanted me to remove the [WIP] tag from the task before you can review it? [20:13:49] madhuvishy: yeah, I merged it though, you removed it, right? [20:14:22] milimetric: yes i removed it. but i thought you had comments, and i hadn't addressed Marcel's comments on the previous patchset either [20:14:25] man ok, internet too slow to do this, heading home, milimetric you around for a bit, right? might want sanity partner shortly [20:14:33] (Abandoned) Milimetric: Disable 19 queries from the scheduler [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/210364 (https://phabricator.wikimedia.org/T98979) (owner: Milimetric) [20:14:41] oh wait, phone looks stronger, gonna try that real quick [20:14:56] ottomata: yes, i'm here [20:15:09] madhuvishy: sorry! [20:15:18] I thought once you removed it you meant it was ready :) [20:15:32] mforns: you gotta -1 things if you want to block the merge [20:16:00] madhuvishy: feel free to submit another change [20:16:08] milimetric: no problem. ya that's where i got confused. I thought you were saying, you would review only if i removed WIP [20:16:30] ok, this is better! [20:16:34] it's true, but if someone else reviewed and found a problem, they need to -1, otherwise others reviewing don't know [20:16:43] ottomata1: good stuff [20:17:29] milimetric: ahh, that's why. that's okay. mforns suggested we name it queuedForInsertion rather than insertAttempted [20:17:48] ok, milimetric in order to do this i will have to restart brokers, and the same thing is going to happen [20:17:58] madhuvishy: i'm ok with either of those names [20:18:00] we will lose dat ain kafka, and this time, we aren't running camus, so its gonna disappera [20:18:13] ottomata1: can we run camus for a bit, let it catch up? [20:18:24] hm, i don't think so. [20:18:28] i mean. hm [20:18:30] i'm not sure [20:18:51] but i don't think so, as it just causes more load. and i don't know how long it was lagging either, hm, lemme look at something [20:19:03] right, so we'd lose a few hours [20:19:42] ja, last hour we have data for right now is 17 [20:19:48] we are into 20 now [20:19:57] milimetric: i'm going to do a manual camus run and see how it goes [20:19:57] milimetric: okay, me too, i will leave it as is then. [20:20:11] it'll likely run for the full 55 minutes it is allotted, we'll see how much it imports during that time [20:20:39] ok, ottomata1 then we can calculate how long we'd have to wait for it to catch up fully and weigh the data loss against the possible risk of running as is [20:21:40] interesting! [20:21:46] 15/08/11 20:21:27 ERROR kafka.CamusJob: The current offset was found to be more than the latest offset [20:21:46] 15/08/11 20:21:27 ERROR kafka.CamusJob: Moving to the earliest offset available [20:21:58] that means that camus did import some of the stuff we lost from kafka when we restarted an22 [20:22:34] ah, cool [20:22:40] ottomata: Fun :) [20:23:02] ottomata, milimetric : I have backlogged, but what is current status ? [20:23:25] we're seeing how long it takes camus to catch up [20:23:36] :) [20:23:41] to consider when / how we upgrade all the other kafka brokers with the new snappy [20:23:41] joal: current status, looks like the snappy update fixes the problem [20:23:45] After having restarted all the beasts ? [20:23:48] log sizes for leader partitions on an22 are much smaller [20:23:51] no [20:23:53] no other restarts [20:23:56] ottomata: That is ggod news :) [20:24:05] ok cool [20:24:08] we are seeing if we can avoid losing data by running camus before we do a restart [20:24:14] hopefully we can get the data into hdfs from the leaders [20:24:19] dunno though, cause that will cause more IO on kafka [20:24:25] hmm .. [20:24:29] we are trying a manual camus run to see how it does [20:24:39] ok [20:24:45] For how long as been off ? [20:24:54] a while, several hours now [20:24:59] 3h roughly ? [20:25:14] yes [20:25:43] 2h actually [20:26:41] naw, this is no good really [20:27:25] hm [20:27:26] not sure [20:27:28] so, when camus runs [20:27:32] I suggest we restart the brokers before trying to kafka [20:27:45] network bytes in drop[s [20:27:57] which i guess could just be the replicas not fetching [20:28:11] but, coudl also be producers failling [20:28:13] not sure though. [20:28:18] I think having less data (sizewise) to sync + to send to HDFS is better [20:28:24] Even with a little bit more lag [20:28:32] i'll be in a meeting for a bit, btw [20:28:42] k milimetric [20:29:16] joal: huh? [20:29:19] you mean trying to camus? [20:29:30] the problem is, we can't restart brokers without truncating data [20:29:41] since many partitions only have one ISR - the leader [20:29:55] i that leader is restarted, another broker is promoted to leader, and it is far behind [20:30:01] possibly hours at this point [20:30:03] I mean I'd rather restart the brokers, ensure kafka cluster get's back into (kindda) stability, then restart camus [20:30:24] ottomata: true [20:30:36] joal: me too, but that would mean likley dropping several hours of data [20:30:58] Now it's poker game: either we wait, and expect the traffic downtime to allow the cluster to catchup and let us restart without too much dataloss [20:31:18] Either we restart now, because we think that thing will never catch up [20:31:31] I make my poker face on that call :) [20:32:41] seems like a big drop in network bytes in 5/10 minutes agon [20:32:45] normal ottomata ? [20:33:21] joal: that is me starting camus [20:33:26] wow [20:33:30] i *think* that is from replicas lagging [20:33:30] ok ! [20:33:44] few, big drop ! [20:33:55] it could also be from produce reqs failing from vks [20:33:55] but [20:34:00] i'm looking at them now, and i don't think so [20:34:28] ok [20:34:31] so no data lost [20:34:32] +1 to restart brokers guys, seems like camus is not likely to get back to normal with the current problems [20:34:34] cool [20:34:56]