[06:28:43] morningggg [06:29:15] so the new retention.bytes for kafka big partitions works like a charm, and I am happy, but varnishkafka is no [06:29:18] not [06:29:19] :@ [06:29:27] Quarry: Queries running for more than 4 hours and not killed - https://phabricator.wikimedia.org/T137517#2370649 (Dvorapa) [06:30:04] Quarry: err - https://phabricator.wikimedia.org/T137518#2370652 (Dvorapa) [06:30:19] Quarry: err - https://phabricator.wikimedia.org/T137518#2370664 (Dvorapa) Open>Invalid [06:32:04] Quarry: Queries running for more than 4 hours and not killed - https://phabricator.wikimedia.org/T137517#2370670 (Dvorapa) [07:34:44] Quarry: Queries running for more than 4 hours and not killed - https://phabricator.wikimedia.org/T137517#2370631 (Krenair) It doesn't appear to be running anything at the moment: ```root@quarry-main-01:~# mysql -h enwiki.labsdb -u u2029 -p enwiki_p -e "show processlist" Enter password: +----------+-------+-... [07:36:14] indeed elukey, still some error warnings [07:41:10] joal: if we check only from XX:01 to XX:59? Not sure if you tried [07:41:29] for a given hour of course [07:41:41] elukey: I actually realised that it's not a valid check (after having tried, abviously) [07:42:23] elukey: Events are mis-aligned on sequence/timestamp scales - this wouldn't change even if we change boundaries :) [07:44:02] mmmm [07:47:03] ah because the holes will move on different boundaries [07:47:10] ah yes [07:47:15] :/ [07:47:41] elukey: indeed, misalignment applies whatever the boundaries :) [07:48:12] By sideeffects, you might find for one sample a boundary that reduce teh misalignement, but it's only a side effect and only works for on sample [07:50:31] yeah... [07:50:51] joal: as FYI, https://www.varnish-cache.org/docs/trunk/reference/vsl.html - The Timestamp section is the one that we care about [07:51:45] so the %{format}t timestamp in vk uses "Start" by default [07:51:58] and if you put "end:" it uses "Resp" [07:53:02] but between Resp and the seq number generation vk has to process all the req tags [07:53:07] to compose the log [07:53:33] so the next step that Andrew proposed was to add something like "seq_timestamp" [07:54:03] generated by vk and not read from Varnish tags [07:54:09] elukey: What about using Process? [07:54:39] I was thinking the same but I feel that we are trying to bend a problem to resolve another one [07:55:42] I mean, imho we should use Start for our data, and then seq_timestamp for our consistency check [07:55:43] hm [07:56:20] elukey: I think having different timestamps is very error prone [07:56:28] We should try to stick to one [07:57:14] I don't think it matters a lot if we have start, end, process, or seq_timestamp as a time marker for a row: it'll be so close that it seems neglectable [07:57:26] However for consitency check, it matters [07:57:56] How difficult is it to generate the seq_timestamp? [08:01:31] it shouldn't be super difficult, especially now that I know a bit better how the parsing stuff work [08:01:44] (before that it was hic sunt leones) [08:02:30] the important part is that we are now sure that we are not loosing data :) [08:02:51] elukey: sure, but we have lost our control check :) [08:06:55] joal: yeah it is bad, but if you see all the code reviews and code diffs from vk3 to vk4 (without tests!!) you would be amazed that this is the only issue :P [08:07:23] elukey: I am already very impressed by the changes :) [10:12:16] * elukey lunch! [10:45:30] Quarry: Queries running for more than 4 hours and not killed - https://phabricator.wikimedia.org/T137517#2371169 (Dvorapa) Therefore it looks like there is an error with status of the query. Please see screenshot: {F4149692} [11:43:56] joal: o/ [11:44:19] heya [11:44:29] do you have 5 minutes for my usual data ramblings? [11:44:36] Please please :) [11:44:41] thanksss [11:44:58] so I took a look to the last email from oozie about data loss [11:45:04] and ran my script [11:45:56] so the script is https://phabricator.wikimedia.org/T136314#2334115 [11:46:31] I realized that it is too simple because it gives me also all the hosts with timestamp XX:59 :D [11:46:50] that have of course their next sequence in the next hour [11:47:07] so I used https://phabricator.wikimedia.org/T136314#2334115 anyway to check timestamps [11:47:16] and they are nicely oredered as we expect [11:47:43] a couple of hosts did have timestamps not around XX:59 [11:47:45] so I checked them [11:48:09] eqi stat1004 [11:48:12] oops :) [11:48:15] two of them were legitimately having the next request served in the first second of the next hour [11:48:20] so good [11:48:25] one didn't have the timestamp :/ [11:48:27] buuut [11:48:39] hostname uri_path http_status response_size dt sequence [11:48:42] cp1058.eqiad.wmnet /mediawiki/1.26/mediawiki-1.26.3.tar.gz 200 0 - 1152911 [11:49:01] elukey: I think this means varnish restart [11:49:02] that looks like something happened while flushing data to the client (resp_size 0??) [11:49:14] PLUS [11:49:17] oops, actually no, you're right [11:49:20] I didn't see any email from this morning [11:49:33] till sorry [11:50:35] so there might be the chance that these emails were legitimate stuff not related to the issue? [11:51:12] elukey: I think we need more checks before assuming that [11:51:19] ah sure [11:51:25] that was only a thought :) [11:52:07] I am saying that because I tried to make the change for the sequence timestamp and it is not that easy :P :P [11:53:03] k [11:54:12] Currently trying to check if there is a lot of unordered requests for a given host [12:21:42] elukey: I think I have good news,but let me triple check before [12:35:25] elukey: I got some results [12:35:28] elukey: batcave? [12:37:22] joal: sure! 2 mins that I need to finish a thing with mw hosts [13:49:26] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2371668 (JAllemandou) There still are warnings from webrequest sequence checks. **TL;DR** Luca's code solved the ordering issue, bu... [13:51:54] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2371687 (JAllemandou) Proposed solution: If @elukey confirms that missing timestamps come from error in Varnish (timeout, connection... [13:53:59] elukey: please review my comments --^ [13:55:39] sure! [14:02:01] wow huh joal reading! [14:02:19] joal: just to be sure I understand: some records dont' have dts [14:02:26] correct sir [14:02:26] but they do have sequences [14:02:40] and they are getting put in an hour, basically out of sequence [14:02:43] ja? [14:02:48] and that is throwing of calcs [14:02:56] you have it [14:02:57] i'm pretty sure that camus defaults to current timestamp [14:02:59] if it can't find one [14:03:07] ok, then it makes sense :) [14:03:10] ja [14:03:11] hm [14:03:22] can we reject these? why do we have these rows anyway? [14:03:38] if an error is read at the beginning of hour, it might have occur at the previous hour, and is flagged with current [14:03:59] aye, but what are these? why is vk sending these at all? [14:04:39] elukey is onto this ottomata [14:05:53] :) [14:06:04] o/ [14:06:13] ? /socket.io/ [14:06:19] ottomata: But at least now the problem is less of a real one ;) [14:06:22] the one you pointed out in the ticket is etherpad socket io? [14:06:22] heh [14:06:45] that sounds like a real request! [14:08:55] one of the errors was [14:08:58] ottomata: the requests we have seen look real, but they might ahve ended in error [14:09:03] hostname uri_path http_status response_size dt sequence [14:09:10] cp1058.eqiad.wmnet /mediawiki/1.26/mediawiki-1.26.3.tar.gz 200 0 - 1152911 [14:09:39] hm [14:09:40] I suspect that when Varnish doesn't flush all the response then it doesn't create the SLT_Timestamp Resp [14:09:49] and vk puts the default, that is - [14:10:13] i wonder if there is some pattern of long running responses here [14:10:22] socket.io and a tar.gz file [14:10:35] allthough 200 is not a big response size [14:10:37] oh [14:10:38] sorry [14:10:39] that is http status [14:10:46] response size 0? [14:10:54] ottomata: default? [14:11:19] hmmm [14:11:27] maybe there's a special config for varnish and long responses [14:11:32] that tells it to pass through in some way [14:11:45] elukey: what's cache status for that? [14:12:34] hm, empty too i bet [14:13:05] not sure, IIRC joal mentioned pass? [14:13:07] I can check [14:13:15] depends on requests [14:13:51] aye [14:13:57] the etherpad one i'm looking at is empty cache_status [15:32:35] mmmm not able to repro for the moment the issue [15:59:07] a-team: I am dealing with a lot of servers just reimaged in codfw, we'd need to clear the alarms before the weekend.. anything against me skipping the standup and sending a escrum? [15:59:21] no prob for me elukey [16:02:25] ottomata: standup? [16:02:43] a-team: standup? [16:05:25] ahhH! [16:05:59] trying to join [16:06:48] I'm sleeping [16:07:31] ah snap but probably we'd need to talk about Berlin [16:07:39] maybe I'll join quickly [16:37:41] milimetric: lets meet at 3 outside the apartment at Friedschstr and then ottomata will reach around then too? [16:44:12] sure. I'll just ride the subways and sleep randomly until then :) [16:44:32] woah I just got that ping, madhuvishy ^ I have a 7 minute delay?! [16:45:03] milimetric: aah [16:47:22] milimetric: do you like zoos :) http://www.zoo-berlin.de/en/zoo-experience/feedings-and-shows [16:48:29] aw, the lions don't get fed on Monday or Thursday?! [16:48:41] yes, I like zoos [16:48:56] I like getting lost in cities I've never been to though, so I might do that for a while [16:51:43] cool [16:56:01] milimetric: the neighborhood of the apartment is a good one to wander in [16:56:11] also the holocaust museum is pretty amazing for wandering [16:56:25] which i think is close to the apt too [16:56:46] how big does Berlin feel? closer to Philly or New York or..? [16:57:24] from what I saw on the map, flat is 500m from Checkpoint Charlie [16:58:18] wow, close, but joal we're going together right? [16:58:50] milimetric: I think we sure will, but it's not an issue if pass by wandering :) [17:19:23] Analytics-Kanban, Patch-For-Review: Update mediwiki hooks to generate data for new event-bus schemas - https://phabricator.wikimedia.org/T137287#2372389 (Ottomata) Hokay! Implementing `__clone` was not necessary, PHP does the right thing. This works! [17:24:49] hm milimetric not like ny [17:24:53] not really like philly either [17:24:53] hm [17:25:00] but closer to philly than ny i guess [17:25:11] i'd say its maybe more like chicago, but cleaner? [17:25:50] ha milimetric maybe more like zurich :p [17:25:59] but a little greener [17:27:20] cleaner than chicago? not possible! :) [17:27:28] ok, thanks, sounds manageable in size then [17:27:36] with public transit anyway [17:27:43] ja def [17:28:16] ha i lived on western ave on the blue line in chicago, so I don't htink of it being very clean :p [18:13:54] * elukey goes afk! [18:13:57] byeeee o/ [18:14:01] see you in Berlin :) [19:38:22] joal: still around? [23:31:13] Analytics, Operations, Traffic: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2373208 (ori)