[05:45:21] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526649 (awight) > That probably made sense in 2006, when the article that SO post is based on was... [05:56:47] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526651 (BBlack) Even if browsers allow >2K URLs, they seem like a poor idea in general. Even a 1... [06:44:47] Analytics-Wikistats: Adding Odia (Oriya) Wikisource to Stats Wiki - https://phabricator.wikimedia.org/T108012#1526669 (psubhashish1) [[ https://phabricator.wikimedia.org/p/Aklapper/ | @Aklapper ]]: Sorry for that. I was not aware as I was reporting a bug of this kind for the first time. Thanks for helping ou... [07:15:34] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526691 (Tgr) So why don't we just use POST? `sendBeacon` actually does that, we just abuse it cur... [07:31:52] Analytics-Backlog, Analytics-EventLogging, Traffic, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526718 (BBlack) Probably because beacon is used with the analytics pipeline rather than the appse... [09:16:10] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1526787 (Aklapper) p:Normal>High [09:16:16] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Check whether it is true that we have lost 40% of code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1526789 (Aklapper) p:High>Normal [09:21:46] Analytics, MediaWiki-API, Reading-Infrastructure-Team: Load API request count and latency data from Hadoop to a dashboard (limn?) - https://phabricator.wikimedia.org/T108414#1526797 (ArielGlenn) [09:25:12] Analytics-Tech-community-metrics, Engineering-Community, ECT-August-2015: Automated generation of repositories for Korma - https://phabricator.wikimedia.org/T104845#1429335 (Aklapper) Discussed in our meeting: * Bitergia provides the automated process to gather data from Git/Gerrit and have a list upda... [09:36:54] Analytics-Tech-community-metrics, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1526816 (Qgil) Resolved>Open Sorry, I have to reopen. In July, FSF has -2 and Deutsche Telekom has -1. This means that the... [09:37:09] Analytics-Tech-community-metrics, ECT-August-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1526819 (Qgil) [09:46:19] Analytics-Tech-community-metrics, ECT-August-2015: Jenkins-mwext-sync appears in "Who contributes code" - https://phabricator.wikimedia.org/T105983#1526830 (Qgil) a:Dicortazar [09:47:57] Analytics-Tech-community-metrics, ECT-August-2015: Tech metrics should talk about "Affiliation" or "Organization" instead of companies - https://phabricator.wikimedia.org/T62091#1526841 (Aklapper) [09:48:13] Analytics-Tech-community-metrics, ECT-August-2015: Tech metrics should talk about "Affiliation" or "Organization" instead of companies - https://phabricator.wikimedia.org/T62091#659085 (Aklapper) (updated summary as per last comments) [09:51:26] Analytics-Tech-community-metrics: Closed tickets in Bugzilla migrated without closing event? - https://phabricator.wikimedia.org/T107254#1526845 (Aklapper) [12:45:58] Anyone know who the phabricator admins are? [12:46:02] greg-g, you look plausible [13:44:26] morning ottomata [13:44:31] I've been looking at the loss [13:44:50] it looks like we have duplicates on a few hosts, and loss on cp1008 [13:45:07] cp1008.wikimedia.org, that is [13:45:40] morning [13:45:47] cp1008.w sounds like a weird host [13:45:54] i am looking at lots of stuff too [13:45:59] let's sync up in a bit [13:46:02] with joal too [13:46:08] Heya [13:46:15] give me 5 minutes please :) [13:47:26] k [13:48:52] batcave ? [13:49:26] ja i need more than 5 [13:55:41] Analytics: Transform to XML-->JSON in sorted file format - https://phabricator.wikimedia.org/T108684#1527274 (Halfak) NEW [13:56:52] FYI joal, https://phabricator.wikimedia.org/T108684 [13:57:10] ottomata: cp1008 gets a tiny number of requests, the diff is like 4 events, 6 events 16 events, etc. [13:58:02] halfak: Thanks ! [13:58:20] Will move that in our boards :) [13:58:23] halfak: --^ [13:58:31] :D [13:58:46] Analytics, Analytics-Backlog: Transform to XML-->JSON in sorted file format - https://phabricator.wikimedia.org/T108684#1527285 (JAllemandou) a:JAllemandou [13:59:20] halfak: I move the old one to done [13:59:24] With a comment [13:59:38] ok ready for batcave! [14:03:04] Analytics-Cluster, Analytics-Kanban: Read wiki dumps in Spark {hawk} - https://phabricator.wikimedia.org/T92385#1527305 (Halfak) I just talked to @JAllemandou and it looks like he's gone as far as he can with spark. We're able to read and extract JSON from XML dumps at high speed, but we're not able to u... [14:17:09] joal: http://grafana.wikimedia.org/#/dashboard/db/kafkatest [15:01:51] Ironholds: releng, generally. Mukunda (twentyafterfour ) and andre specifically [15:02:07] (andre isn't part of releng, but whatevs) [15:02:11] greg-g, thanks! It got resolved :) [15:31:42] halfak: no update on my side, except if you need me, I'll skip the altiscale meeting [15:32:00] joal, could you respond to Soam [15:32:05] 's email re. use of spark? [15:32:10] I will ! [15:32:17] Otherwise, I'll kill the meeting. [15:49:51] (CR) Milimetric: "I think the idea with Sunday was that data would be ready to look at on Monday when people got back to work. That seems like a good enoug" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/230649 (https://phabricator.wikimedia.org/T108593) (owner: Mforns) [16:00:23] (PS2) Mforns: Add support for weekly frequency and granularity [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/230649 (https://phabricator.wikimedia.org/T108593) [16:13:45] halfak: https://twitter.com/CompSciFact/status/631129335471190016 [16:13:55] halfak: makes me think of our work :) [16:14:14] :D [16:14:16] +1 [16:18:11] joal: still going to lunch soon, but you are totally right about kafka log file sizes [16:18:20] 2015-08-10T08 53151892928 [16:18:20] 2015-08-10T09 55836248198 [16:18:20] 2015-08-10T10 57984054557 [16:18:20] 2015-08-10T11 63353197416 [16:18:20] 2015-08-10T12 68184938548 [16:18:21] 2015-08-10T13 69259218741 [16:18:21] 2015-08-10T14 79567698089 [16:18:22] 2015-08-10T15 133643184876 [16:18:22] 2015-08-10T16 168515916825 [16:18:23] 2015-08-10T17 181394338213 [16:18:31] thats' summed logs sizes grouped by hour [16:21:09] ottomata: let me know when you're back from lunch [16:25:40] Analytics-EventLogging, Analytics-Kanban: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1527778 (mforns) a:mforns [16:44:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [16:46:23] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [17:01:26] joal: see the message about snappy? Do you know what version we're using? [17:04:07] aha! https://github.com/wikimedia/operations-debs-kafka/blob/a63cb8805fbdc80a053344d599e302e17eefa9e0/debian/source/include-binaries#L24 [17:04:16] so that's the problem then, maybe!! :) [17:06:00] we need to update to 1.1.1.7 and magic should happen and we should all be happy and drink ginger beer [17:06:05] https://issues.apache.org/jira/browse/KAFKA-2189 [17:19:27] milimetric: That looks to the point indeed ! [17:20:28] We need ottomata to actually havethe magic happen, but it seems nailed down :) [17:33:58] joal: meeting? [17:34:07] oops, yeah arriving [17:38:13] ottomata: see the snappy upgrade issue? [17:38:16] https://issues.apache.org/jira/browse/KAFKA-2189 [17:38:20] we need 1.1.1.7 [17:38:21] no tyet [17:38:22] not yet [17:38:26] just signing on [17:38:28] it looks very promising [17:38:32] someone replied to your message [17:39:18] We found that this is caused by messages being seemingly recompressed individually whoa [17:40:33] yeah :) [17:41:17] intersting, ok. the compression is done on the producers though, hm. [17:41:23] not with snappy java [17:41:31] we did upgrade snappy versions on the brokers [17:41:39] but only to 1.1.1.6 right? [17:41:42] yes [17:41:58] the one guy mentioned they saw the improvement from 1.1.1.6 yo 1.1.1.7 [17:41:59] i don't have a full understanding of how all that works, so its worth a try on one kafka broker to see if log sizes change [17:42:01] right [17:42:03] but [17:42:07] he's using the java clients [17:42:11] to compress the messages [17:42:23] the snappy compression is done by librdkafka via varnishkafka [17:42:28] before the messages get to the brokers [17:42:32] i *think* [17:42:33] pretty srue. [17:42:46] hm... and they don't decompress them all the way to hdfs? [17:42:53] consumers decompress [17:42:59] it should be stored as compressed, i think [17:43:01] but i'm not 100% [17:43:14] and its pretty easy to try on a singel broker and see [17:43:20] can drop in the jar and restart [17:43:25] cool, yeah [17:43:58] but maybe we need to upgrade the snappy varnishkafka uses too? [17:46:36] Analytics-EventLogging, Analytics-Kanban: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1528116 (mforns) **Current hypothesis:** The metric's poor sync creates the false alerts. //The raw vs validate... [17:47:55] hm, unlikely milimetric, libsnappy hasn't changed there [17:52:57] ok, i'm going to turn camus off again to let teh ISRS be good before I try to restart a broker [18:06:08] ottomata: cave ? [18:08:40] joal: sure my internet is kinda slow at this cafe [18:08:47] ottomata: np, later [18:08:53] I'd like to follow on kafka [18:09:02] need to take of the baby though for a moment [18:09:06] will be back soon [18:10:04] ok [18:12:06] Analytics-EventLogging, Analytics-Kanban: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1528316 (mforns) I think the simplest option is to **modify the percentage of data points needed to trigger the... [18:12:48] oof, milimetric, joal, replcas are not catching up, even with camus off now. [18:12:52] i think traffic is too high for them. [18:13:19] i can't turn a broker off now beacuse each partition in text/upload only has one leader [18:13:32] wanted to restart one with new snappy lib [18:13:34] hm. [18:13:51] could possibly bring new brokers into the mix, and move partitions over? [18:13:55] with the new snappy version. [18:14:04] maybe just add one, and try to move a single partition? [18:16:09] haha, i want to do this. someone come tell me it isn't insane! [18:17:34] ottomata: um... why not just turn camus back on? [18:17:44] i just turned it off again [18:17:45] you mean? [18:17:46] i did. [18:18:02] turning on camus makes things lag more [18:18:03] i'm confused, cave? [18:18:06] ja [18:18:30] brt [18:19:30] dan, am going to switch to phone internet, cafe internet too slow [18:20:17] k [18:20:43] :) hi joal, we're about to hang out [18:20:47] :) [18:21:41] yargh phone is really slow too! [18:21:47] 4G! where's my LTE?! [18:21:47] gahhh [18:23:07] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: EventLogging Icinga Alerts should look at a longer period of time to prevent false positives {stag} [5 pts] - https://phabricator.wikimedia.org/T108339#1528355 (mforns) If we merge that, we should test that the alert is still working by for ex... [18:31:56] so if the replicas were less busy serving camus [18:32:02] then they'd have more time to catch up [18:32:05] and become back in sync [18:32:45] so possible idea: turn camus off for like 10 minutes, see if the replicas are doing better, and then restart a broker with the new snappy [18:36:55] ottomata2, if we wanted maps data in HDFS (it's streaming through kafka but apparently not in Hadoop?) who would we tag on that ticket? [18:37:10] like, is this for the maps engineers, analytics engineers, ops generally, analytics ops specifically, maps ops specifically..? [18:37:36] Ironholds: me, or joseph, but i wouldn't count on it at the moment, we are havin gserious kafka issues after this upgrade, trying to fix things [18:37:40] i have camus turned off at the moment, trying to figure this out [18:38:00] i made a patch to do it, but fixin gthis is #1 priority [18:38:18] ottomata2, eep. Makes sense! [18:38:40] good luck! [18:38:48] Ironholds: you can tag Analytics-Backlog and it'll always come to our attention, we groom weekly [18:39:12] yurik: ^^^ [18:39:20] milimetric, great :) [18:40:01] thanks! [18:41:27] Analytics-Backlog: Stream maps cluster requests into HDFS - https://phabricator.wikimedia.org/T108717#1528428 (Ironholds) NEW [18:41:39] Analytics-General-or-Unknown: Statistics for Wikidata API usage - https://phabricator.wikimedia.org/T64873#1528435 (Addshore) a:Addshore>None [18:42:41] (PS2) Yurik: Make camus import webrequest_maps from new maps varnish cluster [analytics/refinery] - https://gerrit.wikimedia.org/r/230535 (https://phabricator.wikimedia.org/T105076) (owner: Ottomata) [18:46:51] Analytics-Backlog, Discovery-Maps-Sprint, Patch-For-Review: Stream maps cluster requests into HDFS - https://phabricator.wikimedia.org/T108717#1528457 (Tfinc) [18:53:27] ottomata2: we can just talk in IRC, phone's a bit hard to hear anyway [18:53:52] yeah [18:53:53] k [18:54:05] so if you refresh http://grafana.wikimedia.org/#/dashboard/db/kafkatest [18:54:09] i added a text-7 log size graph on the lower right [18:54:21] hmm, is that mesages or bytes? [18:54:24] lemme see.. [18:55:44] hm not sure [18:56:06] if it was bytes, then if this bug is the culprit, we will see an22's size for this partition go down over time [18:56:13] maybe for all of them thouhg, since an22 is the leader [18:56:22] no idea how this snappy stuff would be relevant relaly though [18:56:24] i'm going to reply to that emamil [19:00:12] well, ottomata the messages in per second and bytes out per second dropped by a similar factor on an22 [19:00:20] so we can keep comparing that as it catches up [19:00:31] and what we're looking for is the network dropping and the messages rising back to normal, right? [19:01:56] yes, that and smaller log size [19:02:26] I am looking at disk read/write for 1022 [19:02:34] see if it changes [19:03:23] yep, we gotta give it some time for all those things, while it catches up [19:05:57] ha, actually, snappy support is built in with librdkafka, doesn't rely on external c lib [19:06:01] https://github.com/edenhill/librdkafka/blob/master/src/snappy.c [19:06:21] hum [19:06:47] i was double checking snappy stuff there, and libsnappy1 isn't installed, and iwas liek UhhHHh how did this ever work. but, thats how! :)_ [19:06:58] right [19:07:08] line 6 makes me fear :) [19:08:58] sorry, my read/write graphs aren't good in that dash joal [19:09:02] they are for only one broker [19:09:11] ottomata: I added my own :0 [19:09:14] will add other brokers, i think i can sum them [19:09:15] ok nice! [19:09:37] no drop in disk write [19:15:31] seems like a small improvement to me [19:15:46] because network out has gone down by about maybe 60% [19:16:00] while messages in is only down around 35% [19:16:30] Guys, need to get diner ! [19:16:48] Will be back after [19:17:09] bon apetit [19:17:21] ottomata: unless you think this change was harmful, which it doesn't seem to have been, I'm for doing it with the other brokers [19:17:36] are the partition leaders roughly balanced now? [19:18:04] no, but they are very very slightly better [19:18:10] but that could be jsut because we aren't runnign camus [19:18:19] network read is going to go down by a lot because we aren't running camus [19:18:36] i want to see a change in log files sizes [19:18:46] that's the only thin gthat will really convince me, because all the other stats ahve lots of stuff flowing through them [19:19:22] milimetric: i think the stas you are looking at are down because when we restarted an22, it lost leadership for lot sof partitions [19:20:03] you can see that manifested in the messages in per second at the top of the dash [19:20:05] i'm looking at the proportion drop in messages in compared to network out [19:20:11] an22 is handling a smaller proportion [19:20:15] ok [19:20:18] so what i'm saying is messages in went down by around 35% [19:20:25] but network out went down by a higher percentage [19:20:32] meaning maybe it's a good sign [19:22:08] hm ok [19:22:14] weird that the log size on 1018 went down when you bounced 1022... [19:22:18] welllllll but network out also includes replication from 2 other brokers [19:22:56] ah, cool, so then the small difference in relative drops is even better [19:22:58] milimetric: i think that's because of the truncation [19:23:03] 1018 was also following 1022 [19:23:03] oh ok [19:23:08] makes sense [19:23:13] looks like 12 became the leader [19:23:15] and it was behind [19:26:26] milimetric: i think that truncation was a lot of data [19:26:37] few hours worth for those partitions [19:26:41] oh, wow [19:27:06] but most of that was already ingested via camus right? [19:27:13] except the last 20 minutes when it was not running? [19:27:35] not sure, camus was lagging too because we had turned it off for a bit [19:28:28] before you restarted 1022, jo said camus had finished running [19:28:48] can it finish before ingesting everything? Does it have a max ingestion size or something? [19:29:53] Hey ottomata, milimetric! If u have a sec, got some more quick questions here... Did I understand correctly yesterday that it's possible to send data to Kafka directly from the browser? (Just looking in the logs of our chat, but I couldn't find it...) [19:29:54] Analytics, Reading-Web: make MobileWebUIClickTracking schema usable - https://phabricator.wikimedia.org/T108723#1528745 (Jdlrobson) [19:30:00] Analytics, Reading-Web: make MobileWebUIClickTracking schema usable - https://phabricator.wikimedia.org/T108723#1528729 (Jdlrobson) We already sample. We can sample more but this is going to be a lot of data regardless of what we do. The main issue you are having with sampling is purging is not happening... [19:30:33] milimetric: i think it has a max run time [19:30:41] and will quit and let a new instance start up [19:30:44] AndyRussG: yes, it's possible, but right now we're still messing with some prod issues. I'll side-chat you :) [19:31:03] # Max minutes for each mapper to pull messages (-1 means no limit) [19:31:03] # Let each mapper run for no more than 55 minutes. [19:31:03] # Camus creates hourly directories, and we don't want a single [19:31:03] # long running mapper keep other Camus jobs from being launched. [19:31:03] kafka.max.pull.minutes.per.task=55 [19:31:09] k, hm [19:31:21] AndyRussG: not from the browser, no, but maybe! [19:31:24] so we'll see that in the webrequest stats [19:31:56] you won't be able to 'send directly to kafka' from the browser, but the new event system will support something like this. probably in the context of eventlogging, but maybe not. [19:32:30] ottomata: what about posting something and having varnish-kafka send it to kafka? [19:32:59] AndyRussG: that is possible, and how eventlogging on kafka will work soon. [19:33:04] not POST though, since eventlogging is query data [19:33:22] haven't tried post with varnishkafka, dunno if/how that is available to the varnishlog api, but i would think it would be [19:34:04] milimetric: it is still yet hard to know for sure, but i thikn logs are smaller. [19:36:12] i think we'll have to wait until this hour is out for a good comparison. [19:36:15] Analytics-Kanban: Check and potentially timebox limn-flow-data reports {tick} [5 pts] - https://phabricator.wikimedia.org/T107502#1528764 (mforns) a:mforns [19:37:18] yeah, ottomata, the messages in per second recovered on 22 and the network out stayed low [19:37:43] hm also, milimetric, the partitions for which 22 is the leader have a full ISR now [19:37:46] that is a good sign [19:37:46] that makes me fairly optimistic about trying this with other brokers. Though what you said about the data loss is sad [19:37:53] cool [19:41:32] milimetric: i'm fairly certain this is working. [19:41:50] i'm comparing 2 upload partitions [19:42:00] one has 22 as an in sync replica, but leader is 18 [19:42:06] the other has 22 as the leader [19:42:40] the one where 22 is the leader is 1/4 the size of the one where 18 is hte leader [19:42:50] which, is the compression ratio i expect from snappy [19:43:09] snappy compresses our json about 25% of orignal size [19:43:16] makes sense to me [19:43:24] at least, it did when i checked that out years ago :) [19:43:32] that's the same thing I'd guess from this relative percent calculation i'm doing [19:43:37] so um, ok. the right thing to do is to actually build a new kafka package with this version [19:43:58] yes [19:44:02] i think i want to do that rather than manually apply this again [19:44:15] because broker restarts hurt righ tnow, and I'd rather the restart we do just have it bring in the right thing [19:44:19] without having to do a restart later... [19:44:20] hm [19:44:28] although, we are planning on upgrading these to jessie [19:44:32] one everything settles [19:46:26] so, i could just replace the jar now and restart brokers and keep fingers crossed. [19:46:29] naw, lemme rebuild... :) [19:47:07] yeah, more sanity is best here I think [19:47:29] we can retrace our steps easier this way too [19:48:54] key part from the email response ottomata: "load generator" [19:49:00] they have a load generator!! :) [19:49:10] lucky dogs [19:49:14] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [19:49:27] its not just load, milimetric. i think you are right that having an awesome staging env for kafka and hadoop will be very useful [19:49:42] i just think it will take a lot of work to build something that will be a useful comparison to prod [19:49:47] not somehting we can jjust spin up in labs in an afternoon righ tnow [19:49:51] yes [19:50:07] but worthwhile, this environment is complex enough that we'll have these issues again [19:51:00] Analytics, Reading-Web: make MobileWebUIClickTracking schema usable - https://phabricator.wikimedia.org/T108723#1528810 (bmansurov) I agree, we should aggregate the old data and then purge it. [19:51:34] yeah [19:53:24] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [20:13:23] milimetric: this morning, did you say, you wanted me to remove the [WIP] tag from the task before you can review it? [20:13:49] madhuvishy: yeah, I merged it though, you removed it, right? [20:14:22] milimetric: yes i removed it. but i thought you had comments, and i hadn't addressed Marcel's comments on the previous patchset either [20:14:25] man ok, internet too slow to do this, heading home, milimetric you around for a bit, right? might want sanity partner shortly [20:14:33] (Abandoned) Milimetric: Disable 19 queries from the scheduler [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/210364 (https://phabricator.wikimedia.org/T98979) (owner: Milimetric) [20:14:41] oh wait, phone looks stronger, gonna try that real quick [20:14:56] ottomata: yes, i'm here [20:15:09] madhuvishy: sorry! [20:15:18] I thought once you removed it you meant it was ready :) [20:15:32] mforns: you gotta -1 things if you want to block the merge [20:16:00] madhuvishy: feel free to submit another change [20:16:08] milimetric: no problem. ya that's where i got confused. I thought you were saying, you would review only if i removed WIP [20:16:30] ok, this is better! [20:16:34] it's true, but if someone else reviewed and found a problem, they need to -1, otherwise others reviewing don't know [20:16:43] ottomata1: good stuff [20:17:29] milimetric: ahh, that's why. that's okay. mforns suggested we name it queuedForInsertion rather than insertAttempted [20:17:48] ok, milimetric in order to do this i will have to restart brokers, and the same thing is going to happen [20:17:58] madhuvishy: i'm ok with either of those names [20:18:00] we will lose dat ain kafka, and this time, we aren't running camus, so its gonna disappera [20:18:13] ottomata1: can we run camus for a bit, let it catch up? [20:18:24] hm, i don't think so. [20:18:28] i mean. hm [20:18:30] i'm not sure [20:18:51] but i don't think so, as it just causes more load. and i don't know how long it was lagging either, hm, lemme look at something [20:19:03] right, so we'd lose a few hours [20:19:42] ja, last hour we have data for right now is 17 [20:19:48] we are into 20 now [20:19:57] milimetric: i'm going to do a manual camus run and see how it goes [20:19:57] milimetric: okay, me too, i will leave it as is then. [20:20:11] it'll likely run for the full 55 minutes it is allotted, we'll see how much it imports during that time [20:20:39] ok, ottomata1 then we can calculate how long we'd have to wait for it to catch up fully and weigh the data loss against the possible risk of running as is [20:21:40] interesting! [20:21:46] 15/08/11 20:21:27 ERROR kafka.CamusJob: The current offset was found to be more than the latest offset [20:21:46] 15/08/11 20:21:27 ERROR kafka.CamusJob: Moving to the earliest offset available [20:21:58] that means that camus did import some of the stuff we lost from kafka when we restarted an22 [20:22:34] ah, cool [20:22:40] ottomata: Fun :) [20:23:02] ottomata, milimetric : I have backlogged, but what is current status ? [20:23:25] we're seeing how long it takes camus to catch up [20:23:36] :) [20:23:41] to consider when / how we upgrade all the other kafka brokers with the new snappy [20:23:41] joal: current status, looks like the snappy update fixes the problem [20:23:45] After having restarted all the beasts ? [20:23:48] log sizes for leader partitions on an22 are much smaller [20:23:51] no [20:23:53] no other restarts [20:23:56] ottomata: That is ggod news :) [20:24:05] ok cool [20:24:08] we are seeing if we can avoid losing data by running camus before we do a restart [20:24:14] hopefully we can get the data into hdfs from the leaders [20:24:19] dunno though, cause that will cause more IO on kafka [20:24:25] hmm .. [20:24:29] we are trying a manual camus run to see how it does [20:24:39] ok [20:24:45] For how long as been off ? [20:24:54] a while, several hours now [20:24:59] 3h roughly ? [20:25:14] yes [20:25:43] 2h actually [20:26:41] naw, this is no good really [20:27:25] hm [20:27:26] not sure [20:27:28] so, when camus runs [20:27:32] I suggest we restart the brokers before trying to kafka [20:27:45] network bytes in drop[s [20:27:57] which i guess could just be the replicas not fetching [20:28:11] but, coudl also be producers failling [20:28:13] not sure though. [20:28:18] I think having less data (sizewise) to sync + to send to HDFS is better [20:28:24] Even with a little bit more lag [20:28:32] i'll be in a meeting for a bit, btw [20:28:42] k milimetric [20:29:16] joal: huh? [20:29:19] you mean trying to camus? [20:29:30] the problem is, we can't restart brokers without truncating data [20:29:41] since many partitions only have one ISR - the leader [20:29:55] i that leader is restarted, another broker is promoted to leader, and it is far behind [20:30:01] possibly hours at this point [20:30:03] I mean I'd rather restart the brokers, ensure kafka cluster get's back into (kindda) stability, then restart camus [20:30:24] ottomata: true [20:30:36] joal: me too, but that would mean likley dropping several hours of data [20:30:58] Now it's poker game: either we wait, and expect the traffic downtime to allow the cluster to catchup and let us restart without too much dataloss [20:31:18] Either we restart now, because we think that thing will never catch up [20:31:31] I make my poker face on that call :) [20:32:41] seems like a big drop in network bytes in 5/10 minutes agon [20:32:45] normal ottomata ? [20:33:21] joal: that is me starting camus [20:33:26] wow [20:33:30] i *think* that is from replicas lagging [20:33:30] ok ! [20:33:44] few, big drop ! [20:33:55] it could also be from produce reqs failing from vks [20:33:55] but [20:34:00] i'm looking at them now, and i don't think so [20:34:28] ok [20:34:31] so no data lost [20:34:32] +1 to restart brokers guys, seems like camus is not likely to get back to normal with the current problems [20:34:34] cool [20:34:56] lets let it finish this run, it will only run for another 30 mins [20:35:03] and then lets see how much it imported in hadoop [20:35:22] hm, if we keep camus stopped, and wait for downtime, is there any chance kafka would catch up in the actual statte [20:35:24] if we think we can get it to import most of the last hours over a few runs, then lets do it [20:35:25] ? [20:35:26] ottomata: --^ [20:35:32] joal: possible, yes. [20:35:46] i tend to think it will, because it was fine most of last night, until high load time now [20:35:53] so, yeah that is an idea too, wait until morning [20:36:05] In that case, wait, then restart when back in stability, then camus (without any other prod job in the cluster [20:36:20] possible that in morning, all ISRS will be fine, and we can restart brokers without losing data, and then run camus without hurting things [20:36:24] it will be a lot more for camus to import [20:36:34] yup [20:36:47] So it will mean tight cluster job management [20:36:53] yeha [20:36:59] lots of backed up jobs [20:36:59] only concern: [20:37:26] If it fails (no ISRS back in track tomorrow), then we loose that much of data in addition to what has already been lost now [20:37:32] yes [20:37:36] true. [20:37:42] * joal puts the poker face [20:37:52] milimetric: thoughts? [20:38:14] well, i mean, we will at least wait for this current camus run to finish [20:38:18] right [20:38:22] maybe 25 more mins [20:38:22] let's wait for that run [20:38:40] ok, i'm going to run home then while this runs [20:38:40] ottomata: what defines a camus run length ? [20:38:52] we'll discuss that later :) [20:38:57] you can get home ;) [20:39:00] https://github.com/wikimedia/analytics-refinery/blob/master/camus/camus.webrequest.properties#L63 [20:39:13] joal: also fine if you want to sign off for the eve. its getting late there [20:39:22] hm, seems like you answered that question not so long ago ! [20:39:26] :) [20:39:57] is set that limit so that a single slow map task wouldn't keep other new camus jobs from being launched [20:40:02] like, say text takes a long time [20:40:04] but mobile doesn't [20:40:10] i still want mobile to be imported regularly [20:40:15] makes sense [20:40:17] don't want it to block on a long text job [20:40:24] so starting a new camus job starts new mappers for all partitions [20:40:49] haha, joal! i need to add this to deployment plan. change camus properties kafka.brokers! :) [20:41:07] WOoooW ! Good call Sir ~! [20:42:22] i mean, we would have nnoticed and it wouldn't ahve hurt [20:42:35] we woulda been like "wheres that data?? OH WHOOPSIE!" [20:42:39] and then fixed but ja [20:42:45] yup [20:43:03] ok, running home, back shortly [20:52:16] milimetric, madhuvishy, sorry I was in a meeting with Jon Katz [20:52:56] sorry for the confusion, milimetric I know I should -1 to block changes, but it was a superficial comment, not critial [20:53:08] I'm also ok with the initial name [20:53:22] :] [20:57:54] mforns: no problem :) thanks [20:58:22] o/ [21:05:41] weird, network bytes out spiked again for an22 [21:05:55] i'll be back in a bit, gotta go grab some stuff [21:13:34] ottomata is back ! [21:13:36] heyo [21:13:54] camus still going? [21:14:03] yup [21:14:12] should be done soon i guess [21:14:14] but load is less than [21:14:18] bafore [21:14:30] network bytes in getting on the uptrend [21:14:49] Also, seems we have past the traffix top point [21:14:57] Should only get down from now on [21:18:22] Hadoop jobs are not too bad: only upload is late [21:18:26] (two hours) [21:18:36] ah ja ok camus done [21:18:44] cool [21:19:24] hm, yeah but 19 and 20 are much smaller than they should be [21:19:25] joal [21:19:26] for tet [21:19:28] text [21:19:32] 30G ./17 [21:19:32] 23G ./18 [21:19:32] 14G ./19 [21:19:32] 948M ./20 [21:19:37] 18 too i think [21:19:57] I don't get it [21:20:02] hm, bits imported [21:20:18] haha, no data in bits though [21:20:24] true ! [21:20:43] ok welp [21:20:45] now what to do. [21:21:00] What the status of ISRS ? [21:21:00] run camus again? [21:21:19] btw joal you can check that too! [21:21:21] log into a broker [21:21:25] then [21:21:27] kafka topic --describe [21:21:54] joal: only upload and text partitions for which 22 is the leader have full ISRs [21:22:01] Cool, I didn't think I had the right to log onto those machines :) [21:22:41] some mobile partitions even only have one leader [21:22:48] probably the camus run did not help with replication though [21:23:29] I think you are right :) [21:23:30] milimetric: joal, i think i want to just restart brokers and deal with lost data. [21:23:41] mobile, text and upload are behing for ASRs [21:24:08] hm ... [21:24:23] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL 20.00% of data above the critical threshold [30.0] [21:24:58] ottomata: Your call, I wonder if letting things stabilise (no camus) wouldn't let us restart in a better state [21:25:18] is hard to tell [21:25:20] But I'll follow you if you prefer to go for a restart now :) [21:25:55] Is there a way to judge how far behind the replicas are ? [21:26:01] ottomata: --^ [21:26:25] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK Less than 15.00% above the threshold [20.0] [21:26:38] yeah hm i think so.. [21:30:31] i can estimate worst case based on max lag on a replica [21:30:34] ... [21:31:27] joal: on analytics1012 [21:31:31] i think 2.5 hours [21:31:37] worst case partition replica loss [21:31:43] that is based on [21:31:52] wait no. [21:32:07] 1012 is the one having the biggest maxlag [21:32:19] no [21:32:21] correction [21:32:23] 12 minutes? [21:32:25] haha [21:32:26] ok [21:32:27] so [21:32:32] i'm taking max lag for ana12 [21:32:45] ~90 M [21:32:49] yup [21:33:01] messages in per sec for webrequest text on an12 is about 10K [21:33:15] so per partition (since max lag is for a partition), that/s 10000/12 [21:33:19] so [21:34:13] (90060226 msgs / (10000/12) msgs / sec ) = 750.501883333 secs [21:34:23] 12+mins? [21:34:56] does that sound right to you? [21:35:19] 10000 / 12 ~ 1000 [21:35:48] 90M / 1000 --> 90000secs ? [21:35:54] wrong in my case [21:36:28] ah oops ja [21:36:29] hm [21:36:33] you are right, missed a paren in my calc thing [21:36:53] but it's wrong, that would mean 25h lag [21:36:55] yeah [21:36:59] haah [21:37:01] hm [21:37:14] wellll [21:37:26] Analytics, MediaWiki-extensions-ImageMetrics, Multimedia, Patch-For-Review: Measure how many users have CORS-hostile proxies - https://phabricator.wikimedia.org/T507#1529416 (Tgr) [21:37:28] that might be right, let's look at a differenet broker. i think something was weird with the lag all yesterday [21:37:34] not sure why, all the ISRs were up [21:38:01] Things are slowly catching up in ISRS [21:38:06] Better with text for instance [21:38:54] ja barely though, hm. [21:39:08] hey, if that other 18 broker leader gets a replica [21:39:12] i think we should restart 18 [21:39:19] oh and in mobile too [21:39:54] there are only 2 text/mobile partitoins that have 18 as a leader without a in sync replica [21:40:08] yup [21:40:20] Wait for a sync, or restart now ? [21:41:20] lets wait a bit i think? [21:41:24] good for me [21:41:35] Max lag chart is stabilised [21:41:59] Traffic is going slowly done --> Hopefully things will catch up and we won't loose more data ! [21:42:36] * joal switches from chart to ISRS and back to charts in circles [21:44:10] whoa max lag on 18 is going down! [21:44:14] obvisouly, the partitions that takes longer are theo nes with 12 and 21, which are more behing in max lag [21:44:22] yeah [21:44:24] HUURAYYY ! [21:44:51] Big downslope in messages per second in though [21:45:06] cool the mobile one is there, now just one 18 leader partition to get areplica [21:45:42] one for tect, one for uploads [21:46:04] uploads i'm less worried about [21:46:09] ok ottomata [21:47:12] hey guys, I just caught up [21:47:26] Hi milimetric :) [21:48:22] joal: where do you see the downturn in messages per second, the kafkatest dashboard seems ok [21:49:08] i'd expect a little downturn due to traffic slowing [21:49:17] Looking at kafkatest, 24h span - last hour is getting down [21:50:06] Just changed the dashboard (charts order) [21:50:09] you can reload [21:50:27] We loose more messages than varnishkafka drops it seems [21:51:01] Too difficult to say [21:51:12] maybe not [21:52:55] mmm, yeah, maybe a little sharper than vkafka [21:53:04] hard to say though, yea [21:54:05] joal: what's that little 10 minute dip in all messages in around 21:08? [21:54:12] (all brokers have it) [21:54:34] Don't know milimetric [21:54:52] did something happen 44 minutes ago? [21:55:30] hm, nothing more than camus I think [21:56:04] and camus had been running for a while at that point, no? [21:56:09] yup [21:56:38] camus finsihed about that long ago [21:57:00] the drop happens a while before I think [21:57:42] Maybe not actually [21:58:13] Weird ... Why camus stoppping would lead to less messages being received ? [21:58:40] yeah, weird [21:58:44] ok, so how's 18 looking now? [21:58:54] stil behind on text [21:59:06] 1 partition [21:59:11] when that's caught up you're gonna restart it? [21:59:21] That's hte plan [21:59:24] cool [22:00:07] And actually, 18 is not behind, 12 and 21 are, and we are waiting for them to replicate over 18 :) [22:00:32] so they can be proper leaders when 18 leaves, that's what I thought you were doing [22:00:45] Correct [22:06:03] unfortunetly we are waiting for 21 and 12 to replicate, and they have more partitions in general [22:07:43] ottomata: Upload is good, only text left for 18 [22:07:54] ottomata: some reading along the way : http://siliconangle.com/blog/2015/08/11/etsy-going-all-in-with-kafka-as-dataflow-pipeline-hpbigdata15/ [22:09:06] Also: http://mahout.apache.org/ [22:09:57] joal: we have mahout installed [22:10:02] ellery has used it [22:10:10] I have seen that [22:10:17] Now mahout runs on Spark :) [22:11:56] heh, people love their vertica at Etsy, interesting [22:13:42] milimetric: Having used it a bit, vertica is quite impressive [22:13:50] milimetric: But far too expensive [22:14:30] i mean more interesting from the analyst interface point of view. I was convinced but am becoming certain that SQL is never going to die [22:14:57] I can't give you wormg :) [22:14:58] Analytics-EventLogging, Need-volunteer: Add sanitized User-Agent to default fields logged by EventLogging - https://phabricator.wikimedia.org/T54295#1529669 (Tgr) This happened a while ago, except the "sanitized" part (I think). Can this task be closed, or should it be refocused on the sanitization part? [22:16:20] cmooooon last partition [22:16:26] dunnOoooo [22:16:34] :D [22:17:07] i want to say that i've been following this whole thing on irc and it's super interesting (although super stressful for yougaiz) [22:17:21] :) [22:17:34] ok well, max lag is decreasing on all brokers i think. so, maybe we should just leave it until morning? [22:17:36] without camus runnign? [22:17:37] :hug: madhuvishy :hug: [22:18:34] * madhuvishy sends hugs to everyone :) [22:19:06] :) [22:19:22] ottomata: I see nothing wrong with that except camus is not running [22:19:27] what would happen if we enabled that? [22:19:57] i think max lag would not go down :) [22:20:16] I think you are right ottomata [22:20:41] Is there a way to manually change the assigned partitions ? [22:21:01] I think we could relieve a bit 1012 by giving a few partitions to others [22:21:07] Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1529694 (Tgr) NEW [22:21:24] or maybe just preventing it to replicate [22:21:32] so how much data could possibly back up before camus can no longer catch up if we keep it off for a day? [22:21:35] btw, renaming main dashboard to kafka [22:21:36] http://grafana.wikimedia.org/#/dashboard/db/kafka [22:21:38] i'll brb [22:21:49] Analytics-Backlog, Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1529703 (madhuvishy) [22:22:04] ottomata: This dashboard is really good :) [22:22:20] milimetric: i don't know, but i think if kafka is normal, then camus should be able to catch up [22:22:24] i think it can read faster than we produce [22:23:06] joal, i think not, because the only way to move the partitions is to have them replicate elsewhere. [22:23:08] ottomata: I think you are right, bit it will be messy day on the cluster :) [22:23:17] yeah, true. [22:23:27] ok understood [22:23:27] hadoop will be backed up with lots of jobs milimetric [22:23:33] and we'll have to babysit them [22:24:19] ottomata: i'm happy to help with that if you decide to go that route [22:24:49] thanks! [22:25:02] which is cooler?! light or dark theme? [22:25:11] i think dark might be hard to read... [22:25:17] but looks cooler for sure! [22:25:24] agreed for both ! [22:25:30] i like the light one [22:28:07] ok, well, hm. [22:28:12] so we wait until morning then, joal milimetric? [22:28:26] if so, i guess we should send an email to analytics list saying data will be late and lossy? :( [22:39:39] back [22:39:51] yes, wait for morning is ok if we're fairly confident about camus [22:39:59] i guess it made it overnight with this config and with a worse an22 [22:40:07] so it should be fine now too [22:40:31] Ok let's go for that [22:40:48] it's not a great choice - lose data now vs. lose possibly less possibly more data later [22:40:53] I think the odds are in our favor if we wait [22:41:02] I would have liked to be able to restart the brokers sooner, but waiting is good as well :) [22:41:33] so if the partitions all catch up, we lose less data if we wait [22:41:45] if they don't catch up, we lose more data because less gets consumed into camus [22:41:52] correct [22:41:53] by camus, sorry [22:42:07] and as far as we can tell right now, partitions are on their way to getting more replicas caught up [22:42:11] so ... we wait, makes sense [22:42:17] so i found the individual partition lag [22:42:21] sounds good [22:42:27] I am gonna get some sleep :) [22:42:44] it is kinda hard to say, but i think 18 will be ready for a restart in a couple of hours [22:42:59] joal: ok, thanks or your help, sleep well [22:43:00] ttyt [22:43:05] i will check on this tonight [22:43:06] BYe team ! [22:43:14] joal, good nite! [22:43:46] i'll be around ottomata if you wanna double check before you restart [22:44:13] milimetric: i have a question on the endpoints [22:44:32] https://github.com/milimetric/restbase/blob/test_projectview/mods/pageviews.js#L309 [22:44:51] is this the part that specifies what endpoints we expose? [22:45:19] oh, it just says what resource to respond with [22:45:41] the yaml file defines end points [22:46:17] hmmm, okay then, we don't want to expose the insert fake data endpoints, correct? [22:46:21] madhuvishy: that more defines what resources this module needs in order to work [22:46:32] it's declaring what tables restbase should create for it [22:46:43] milimetric: yup got it [22:46:50] the pageview.yaml defines the endpoints made available [22:47:06] and both of them together I think define the module (what's available and how to respond) [22:47:14] right [22:47:38] madhuvishy: I'm assuming you wanted to change those insert endpoints to insert fake data, right? [22:47:44] if we dont want to expose insert endpoints, can we directly call the insert methods? [22:47:50] we could expose them via pageviews.yaml [22:47:56] and just not expose them via analytics.yaml [22:48:01] but then expose them via test.yaml [22:48:07] milimetric: hmmm i am a little lost [22:48:16] sure, so there are levels of configuration [22:48:36] the top level is defining the modules that get loaded. Let's take test runs as an example [22:48:43] so that'll be ./config.test.yaml [22:48:49] right [22:48:55] that then uses ./specs/test.yaml [22:48:58] yup [22:49:11] and that points to ./mods/pageviews.yaml [22:49:28] so we can leave the insert endpoints down that config path [22:49:55] but not in the analytics one [22:50:01] now, we want to add ./specs/some-folder/analytics.yaml [22:50:13] (because it doesn't make sense in mediawiki i don't think) [22:50:21] right [22:50:22] but yea, in that one we wouldn't include the insert endpoints, right [22:50:33] i see there's an analytics.yaml file existing [22:50:43] right, we made it just for testing, you can ignore I think [22:50:48] okay makes sense. [22:50:58] our analytics.yaml would only configure the pageviews module [22:51:07] basically a copy of the section about pageviews from specs/test.yaml [22:51:16] plus all the other meta config stuff [22:51:18] right [22:51:30] less lost? [22:51:38] so what i've to do first is change the tests so they actually do an insert and check for the data being inserted [22:51:51] right [22:52:05] Analytics-Backlog, Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1529774 (Deskana) As a stakeholder of the EventLogging service provided by Analytics, I request that they decline this task. By definition any co... [22:52:09] and that i can do by just doing get requests to the test insertion endpoints [22:52:14] team, I'll too sign off. see you tomorrow! [22:52:18] madhuvishy: yea [22:52:20] nite mforns [22:52:23] good night mforns [22:52:27] good night! [22:52:33] okay now it's clearer [22:52:36] thanks milimetric [22:52:55] np, hope it works the way I understand things :) [22:53:07] lemme know if not [22:53:12] milimetric: ha ha okay :) [23:22:00] Analytics-Backlog, Analytics-EventLogging, Privacy: Opt-out from logging some of the default EventLogging fields - https://phabricator.wikimedia.org/T108757#1529860 (Krenair) I think @tgr is talking about per-schema opt-out in the software, rather than a user choice. If I understood him correctly, I su...