[06:55:16] In ~5min I'm going to reconfigure one of eqiad's Transit link, and in ~1h5min I'm going to start upgrading eqiad's routers. [06:56:12] <_joe_> XioNoX: what's the risk? [06:56:43] <_joe_> as in - is there risk of local network disruption inside the DC? [06:56:55] _joe_: no impact expected, we did it several times in the past. But let me know if there is anything ongoing for example [06:57:13] <_joe_> XioNoX: we're still running stuff in eqiad [06:57:20] <_joe_> for example, the etcd masters [06:57:30] <_joe_> but it's ok, we can live losing them for a few minutes [06:58:41] _joe_: we did the same upgrades with eqiad being the primary DC so I don't worry much about it [06:59:11] <_joe_> XioNoX: sure sure I wasn't screaming PLEASE NOOO [06:59:17] <_joe_> just trying to understnad :) [06:59:37] yep, giving context :) [07:04:54] I also tend not do trust network engineers [07:05:01] *to trust [07:05:07] :D [07:17:23] as usual there are dumps in eqiad and these are long running jobs now [07:54:25] the telia work is running later than expected, so the router upgrades will be delayed a bit [08:22:59] Telia work is done, gonna take a quick break then start the routers upgrade in ~30min [09:49:43] all kafka jumbo hosts on buster, I think that other kafkas should be able to move anytime [09:49:59] (one host at the time, preserving the /srv partition) [09:50:22] nice! [09:52:11] <_joe_> elukey: have you written down a procedure? [09:52:53] _joe_: is there a task about the Kafka issues (re: save timing), and why it might be taking a second to respond from eventbus for 1/3rd or more of submissions? [09:53:31] <_joe_> Krinkle: I'm not aware of it, no, but do you have evidence of eventgate-main being slow? [09:53:54] (Also, afaik these are post send so can't explain save time increase unless we've broken support for post send itself as a concept somehow) [09:54:00] _joe_ nope but I have everything in the task, it is a regular reimage (just preserving data to avoid all the brokers to pull data from other replicas etc..) [09:55:01] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-main says the current p99 for eventgate-main is between 2 and 5 seconds [09:55:05] _joe_: xhgui profiles for edits where we spend 2 seconds in curl with eventbus from codfw eventbus [09:55:16] Codfw mwdebug* [09:55:51] <_joe_> Krinkle: ok now I see something here [09:55:53] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-main&from=1598875223768&to=1598980853443 [09:56:03] wall time but not cpu time, suggests I/o [09:56:05] <_joe_> you can see the spike since we've moved to codfw [09:56:15] <_joe_> so ok, we need a task [09:56:40] <_joe_> Krinkle: actually, this would square pretty well with the increase in save timing [09:58:22] yeah. Save timing is up 100ms at the p10, 1s at median, 2-3s or more at p75. [09:58:38] But that means post send is also broken? [09:58:48] <_joe_> not broken, just slower [09:59:07] I mean, these metrics are only measuring ttfb [09:59:16] Does not and cannot capture post send [09:59:26] Which suggests we're not actually doing post send [10:00:09] <_joe_> Krinkle: I guess instead we're sending some events before post-send [10:00:14] <_joe_> maybe the purges? [10:00:43] I couldn't find any in my profile that were pre send [10:00:55] <_joe_> what was the task for the save timing? [10:01:18] But that was based only on time outliers, maybe I captured one where the presend is quick [10:01:34] Will take a close look later today [10:02:35] _joe_: https://phabricator.wikimedia.org/T261763 [10:03:11] Most details are in back scroll here in -sre, will summarise as well [10:03:45] <_joe_> yeah I'm opening this task, let's see how things go [10:04:59] Also - useless nag now, but, if this is what it seems, that's a good one to have alerts for which would have fired Monday shortly after the switch [10:08:38] <_joe_> https://phabricator.wikimedia.org/T261846 [10:08:50] <_joe_> elukey: ^^ I did set it to UBN! [10:09:09] <_joe_> I think we can wait for andrew to take a look [10:09:18] <_joe_> but it should be very high on your priority list [10:10:12] <_joe_> this doesn't look good https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgId=1&from=now-2d&to=now&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [10:12:03] it doesn't indeed, 600ms to kafka2003 is a lot [10:12:17] <_joe_> yep [10:12:45] <_joe_> I thought it was matter of underprovisioning eventgate-main [10:13:17] <_joe_> but that's not the case [10:14:00] https://grafana.wikimedia.org/d/000000027/kafka?viewPanel=19&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=now-2d&to=now is interesting [10:14:24] 2003 is working way more than others, surely due to a consumer that pulls a lot of data from it [10:16:01] <_joe_> elukey: I have one hypothesis [10:16:06] <_joe_> let me test it [10:16:27] <_joe_> my idea is that the issue is we're funneling all the purges in a single unpartitioned topic [10:16:41] <_joe_> while before they were spread across 2 datacenters [10:17:04] there is dashboard for topic volumes [10:17:36] <_joe_> do you have it handy? [10:17:51] <_joe_> anyways my idea was to repool eventgate-main in eqiad [10:18:02] <_joe_> and to move restbase-async to eqiad [10:18:13] <_joe_> that's what generates the purges from restbase [10:18:28] I was checking https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.mediawiki.job.cdnPurge [10:18:34] but not sure if it is the right one [10:18:38] <_joe_> this is ofc just a way to stop the bleeding and confirm that's the problem [10:19:59] <_joe_> elukey: nope, this one [10:20:01] <_joe_> https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.resource-purge [10:20:08] <_joe_> and guess where that is? [10:20:31] <_joe_> akosiaris: can I get a +1 to my plan above to test this hypothesis? [10:20:35] bingo [10:20:43] the out bw also matches afaics [10:20:45] <_joe_> the plan is to failover temporarily restbase-async to eqiad [10:21:04] <_joe_> and also to repool eventgate-main in eqiad [10:21:12] <_joe_> that would move the rb purges to eqiad [10:21:23] <_joe_> while we partition appropriately the topic :) [10:21:36] it can be done with one command IIRC [10:21:47] <_joe_> jayme and everyone else too [10:22:00] <_joe_> elukey: I'd wait to have confirmation of the issue, then discuss with andrew and petr [10:22:17] <_joe_> I think it's ok if we stay a few hours with something active in eqiad [10:23:05] so in eqiad we have one partition as well [10:23:05] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.resource-purge [10:23:28] <_joe_> yes [10:23:34] * jayme looking [10:23:47] <_joe_> but the purges are usually distributed across datacenters [10:23:57] <_joe_> as restbase purges are always in codfw [10:24:04] <_joe_> and the mw purges follow mediawiki [10:28:36] I feel like I don't know enough about kafka for a "usefull" +1 to your idea _joe_ - but from my limited understanding it sounds fine to re-distribute the purges as verification [10:29:03] <_joe_> ok [10:29:16] <_joe_> I'll wait another couple minutes and execute [10:30:10] yep it seems good for me too [10:32:14] <_joe_> ok, proceeding [10:35:30] <_joe_> done [10:39:53] <_joe_> Krinkle: if my hypothesis is correct, you should see the backend save timing go down in the next ~ 20 minutes or so [10:40:21] ack [10:41:46] purges are flowing to eqiad, nice [10:41:57] <_joe_> yes [10:42:08] <_joe_> and I suspect we already needed to partition the topic before [10:43:25] yep it seems a good candidate for 3 partitions [10:44:09] <_joe_> yep [10:44:19] <_joe_> we said so months ago, then went nowhere [10:44:26] <_joe_> ETOOMUCHSTUFF [10:45:12] oook running errand for a while, will read later! [10:52:05] <_joe_> doesn't look like it solved the issue for now, but I suspect we had quite a backlog of stuff in mediawiki at this point :/ [10:54:02] _joe_: that would be pending purges then I guess? [10:55:11] How does that interfere with the save timing? I would assume a purge event sould be send *after* the write had happened [10:56:53] <_joe_> yes but my hypothesis is it gets counted as mediawiki submits it to eventgate [11:00:01] I see [11:00:16] <_joe_> so something is not adding up [11:01:11] kafka messages by topic are expected to go down now, no? [11:01:25] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.resource-purge [11:01:42] <_joe_> that is an artifact I think of the fact we were maxing out on purge rate [11:01:49] <_joe_> but at the same time I don't understand this [11:02:24] <_joe_> https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=54&orgId=1&from=now-1h&to=now&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=codfw_resource-purge&var-kafka_broker=All&var-kafka_producer_type=All [11:03:07] <_joe_> maybe we the topic is hardcoded to be codfw.resource_purge somewhere? [11:03:22] <_joe_> now the messages go to kafka in eqiad [11:05:54] <_joe_> or most probably there is some error in grafana dashboards [11:08:26] <_joe_> https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&from=now-1h&to=now this confirms the view we have from kafka [11:10:11] <_joe_> let's see what happens in the next few hours in codfw, then when otto and petr are awake, we can proceed with the task at hand [11:11:40] all done, lunch time, ping me if there are any issues [11:12:48] ah..okay - thats the "consuming-side" then [11:13:06] <_joe_> uhm I still see a lot of restbase messages in the codfw.resource_purge topic [11:13:15] _joe_: there are at times steep drops in save timing as if the issue never existed. For 40min or so at a time. Not sure if that helps :) [11:13:32] <_joe_> Krinkle: not completely :P [11:14:01] Today around 7 UTC for example [11:14:02] k [11:14:18] <_joe_> Krinkle: the save timing has gone down corresponding to my change [11:14:35] <_joe_> see the p75 after 10:35 [11:16:25] <_joe_> so I think we're on the right path [11:16:35] <_joe_> but I'll revisit after lunch [11:16:50] <_joe_> anyone: call me on the phone if anything needs attention [11:17:58] <_joe_> purges go out correcctly btw [12:37:08] I checked the logs on kafka-main2003 and I see a lot of recurrent [12:37:09] [2020-09-02 12:36:44,920] WARN Attempting to send response via channel for which there is no open connection, connection id 10.192.32.136:9093-10.192.64.98:36916-274299 (kafka.network.Processor) [12:37:20] that IIRC I never seen [12:40:39] seems there since ages ago, weird [12:43:01] and the other end seems always to be a k8s pod [12:43:41] and it is only on the 2003 broker [12:43:59] <_joe_> akosiaris: ^^ [12:44:13] <_joe_> that's probably eventgate-main farting [12:44:25] ? [12:44:54] lemme see which pod that is [12:45:07] <_joe_> we just had a spike of latencies btw elukey [12:45:57] I didn't do anything, I was checking topics :D [12:46:00] <_joe_> I still need to have a proper lunch break [12:46:15] it's eventstrems elukey, not eventgate [12:46:24] ahhhh interesting [12:46:50] <_joe_> uhm [12:47:33] I am tailing the logs and if I filter those warnings I don't see anything on 2003 [12:49:49] another thing that could be worth to try is to force a rebalance of topic partition leaders via "kafka preferred-replica-election" [12:50:19] I would kill that specific pod [12:50:22] see if that helps [12:50:26] I could* [12:50:32] kafka usually does a good job of doing it asyncronously, but sometimes we have to force a manual one [12:51:13] I see some logs filtering the warnings on 2003, it took a bit but something is moving [12:53:19] akosiaris: I checked in kafka main eqiad, I see those logs in there too, so it is probably a red herring [12:53:32] something to check later on if we have time [12:54:08] elukey: IIRC andrew had to change a bit how eventstreams connections were working because of a memory leak [12:54:35] o/ [12:54:45] not sure if they end up being closed or not, but it looks like kafka is trying to send to a pod that has already closed the connection [12:54:48] ottomata: hi! [12:54:51] tryying to follow backscroll and ticket [12:55:14] this eventstreams thing you are talking about is separate from the resource_purge partitioning stuff also talked about above? [12:55:15] or same? [12:55:33] * akosiaris not sure [12:55:43] I 've picked this up at [12:55:46] (03:37:08 μμ) elukey: [2020-09-02 12:36:44,920] WARN Attempting to send response via channel for which there is no open connection, connection id 10.192.32.136:9093-10.192.64.98:36916-274299 (kafka.network.Processor) [12:56:08] it could very well be unrelated and a red herring [12:56:18] hm, so eventstreams should be good at closing kafka connections, but maybe it isn't? [12:56:28] eventstreams http client conections are timed out every 15 minutes [12:56:36] so there are regular interruptions. [12:56:38] clients should reconnect [12:56:58] but the particular eventstreams intsance (and kafka consumer) that was serving the disconnected client should be disconnected and discarded [12:57:05] perhaps it isn't always being disconnected properly? [12:57:45] <_joe_> let's focus on the purges stuff [12:57:50] but, ya that seems probably unrelated to the resource_purge partitioning; eventstreams is active-active usually, [12:57:50] yeah [12:57:52] <_joe_> which is creating a serious issue in production [12:58:28] _joe_: i'm pretty sure we can just create the partitions, even without depooling anything. i think we asked petr if there would be a problem on the consumer side and he said no [12:58:51] yes sorry I reported the warning since it was only on 2003 and it seemed off, but it is also present in eqiad so not related [12:58:53] producers won't care, they'll just start roundrobining or randoming (i forget which) to choose the partition to produce to [12:59:00] consumers hopefully should auto rebalance [12:59:37] ottomata: is it worth to check if anything changes with a preferred replica election? Probably nothing but I'd try it before the increase in partitions [13:00:04] that wouldn't hurt anything, but i think we can tell if it would change anything [13:01:04] if some leaders get shifted to 2001/2002 decreasing the load on 2003 it could give some relief, in theory [13:01:07] https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=1598965262746&to=1599051662746 [13:01:15] right, but if there is only one partition [13:01:18] for this high volume topic [13:01:27] ALL of the load would be shifted to e.g. 2001 [13:01:58] yes ok I agree, but I see a similar load in eqiad before the switchover, and that didn't cause high latencies [13:02:02] also [13:02:03] Topic: codfw.resource-purge Partition: 0 Leader: 2003 Replicas: 2003,2001,2002 Isr: 2003,2002,2001 [13:02:12] so it is already on its preferred leader [13:02:18] hm, that's true elukey [13:02:29] well [13:02:34] elukey: the preferred leader is kinda random [13:02:47] and doesn't know anything about topic volme [13:02:59] its possible in codfw there are more high volume preferred leaders on 2003 [13:03:02] <_joe_> sorry I need to step afk for ~ 1 hour [13:03:24] ottomata: yes that is what I am thinking, maybe not really related but it is an easy command to test [13:03:25] _joe_: k whern you come back I can help add partitions if you like [13:03:31] elukey: it can't hurt [13:03:41] I just had the wonderful experience of my whole building losing power. [13:03:45] let's save the output of kafka topics --describe before and after [13:03:56] akosiaris: ok if I proceed? [13:03:57] so we can compare in case something actually changes [13:03:59] ottomata: sure [13:04:13] elukey: fine by me [13:04:17] all right [13:04:47] Turns out two 63A fuses had tripped. Switched everything back on. If I strangely fall offline for extended periods, I might have died in an electrical fire. Or something broke more seriously. Elecchicken has been notified. [13:05:14] ⚡🐔 [13:05:30] done [13:09:20] nothing changed, traffic volumes are mostly the same, so no help [13:09:22] klausman: i don't know what elecchicken is, but maybe I live with it? [13:09:31] i cannot run my microwave and my toaster at the same time. [13:09:58] elukey: aye [13:11:02] it is weird that even hardware seems the same (at least, vcores and ram size) in eqiad and codfw [13:11:10] ottomata: elec-chicken is a mock version of electrician [13:11:45] haha oh i get it [13:11:50] It's one of the words one of my fave Youtubers (AvE, aka Chris aka The Emperor of Dirt) uses. [13:12:27] haha [13:12:39] Along with Canaderp, chooch-factor (how powerful something is), "skookum as frig", etc etc [13:19:51] ottomata: to alter the partitions, we'd need something like: kafka topics --alter --topic codfw.resource-purge --partitions 3 ? [13:22:15] ya how many brokers do we have? 3? [13:22:16] 3 is good [13:22:27] ya 3 [13:22:33] we should do the same to eqiad [13:22:36] and do it in boht DCs [13:23:53] and also in jumbo too i think [13:23:55] just to keep the partitions the same [13:24:06] yep [13:24:15] weird that we didn't see the same latency in eqiad [13:32:43] oh elukey we can just check this graph to see if election will have any effect [13:32:43] https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=1598967146773&to=1599053546773 [13:33:42] which one it is? I get the main page of the kafka dashboard :( [13:34:00] oh [13:34:11] https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=1599010448162&to=1599053648162&viewPanel=52 [13:34:14] does that work? [13:34:25] ah elukey i bet this is why: [13:34:29] yep it does now [13:34:33] diff in eqiad vs codfw [13:34:45] Topic: codfw.resource-purge Partition: 0 Leader: 2003 [13:34:45] Topic: codfw.resource_change Partition: 0 Leader: 2003 [13:35:27] sure sure, I was almost sure it didn't do much but I wanted to try before the alter [13:35:44] it puzzles me that there is this difference, it feels like something is mising [13:35:47] *missing [13:35:58] Topic: eqiad.resource-purge Partition: 0 Leader: 1001 [13:35:58] Topic: eqiad.resource_change Partition: 0 Leader: 1002 [13:35:58] but adding more partitions for that topic is good anyway [13:36:13] elukey: i think in codfw 2003 is just leader for more high volume topics [13:36:33] so the volume load is not balanced, even if the preferred leadership of # partitions per node is [13:36:42] resource_change is also very big [13:37:19] oh 2003 is also leader for [13:37:20] Topic: eqiad.resource_change Partition: 0 Leader: 2003 [13:37:28] so its getting ALL of codfw and eqiad resource_change [13:37:49] hm i guess that doesn't matter, since only one DC is producing to resource_change or resource_purge at a time? [13:38:04] so all (most?) of the messages for that stream will be in only one of the two eqiad or codfw topics [13:38:50] if those topics are also high volume and single partitioned it might be worth to alter them as well [13:38:58] one topic at the time of course :) [13:40:54] ya [13:41:56] ottomata akosiaris shall we proceed with the first topic alter? [13:42:33] huh honestly they their volume is not that high. around 1K per second or less. [13:42:43] partitioning is still good and will probably help [13:42:44] hm [13:43:16] there are other topics two we might want to consider [13:43:21] codfw.change-prop.transcludes.resource-change-0 [13:43:33] actually maybe just those 3 [13:43:47] resource_change, resource-purge, and change-prop.transcludes.resource-change [13:45:16] resource-purge seems to have a big volume in bytes out [13:46:10] elukey: I 'll defer to ottomata and you about any kafka stuff. I am definitely way less proficient [13:47:12] elukey: if it is not 100% urgent urgent let's wait for jo e and/or petr to be around [13:47:32] pretty sure all will be fine, but if i dunno, cp or whatever consumes resource-purge doesn't rebalance properly we'll start missing purges [13:48:01] ack ak [13:48:22] some things are back in eqiad right now i think yes, so the problem can wait a little bit? [13:50:26] yep I think it can [14:03:59] <_joe_> yes, but we're still under stress, and we need to free up eqiad [14:06:14] <_joe_> ottomata: every consumer is its own consumer group btw [14:06:52] ok [14:07:08] good to know, each consumer will still need to notice the new partitions and start consuming [14:07:10] that should just happen [14:07:20] <_joe_> worst case scenario we just restart purged [14:07:21] _joe_: what kafka client (language?) is this consumer? [14:07:23] aye [14:07:24] is purged? [14:07:37] golang [14:07:38] <_joe_> golang [14:07:41] should be go, using a librdkafka client [14:07:47] nice i know nothing about it, hopefully its great! :p [14:07:48] i'm sure it is [14:07:50] <_joe_> and it uses librdkafka, yes [14:07:52] cool [14:07:56] * elukey cdanis is always first [14:08:08] <3 [14:08:11] <3 [14:08:16] ok, if yall are ready then let's go, elukey you want to push the buttons? [14:08:43] sure! [14:09:31] <_joe_> did we understand why this was so much better in eqiad? [14:10:02] no this is the part that still puzzles me [14:10:24] _joe_: my guess is that there are more high volume topic partitions on a single broker in codfw 2003 [14:10:39] 2003 is leader for both codfw.resource-purge and codfw.resource_change [14:10:41] <_joe_> can we force-rebalance them? [14:10:48] it sthe preferred leader [14:10:56] you can change that yes, but it isn't automatic [14:11:09] kafka-main2003 doesn't exactly look saturated on any resource, fwiw [14:11:13] yeah [14:11:26] these volumes aren't really that high overall though [14:11:27] hmmm [14:11:38] it could be that there are so many consumers of resource-purge? [14:11:38] cdanis: it is serving a big chunk of traffic (kafka bytes out) though [14:11:41] ~10% CPU used, network used is about 37MByte/sec =~ 300Mbit [14:12:35] chaging preferred leaders is the same process as movnig partitions [14:12:36] https://kafka.apache.org/documentation.html#basic_ops_automigrate [14:12:56] except the partitoin layout wouldn't actually move data, it would just reorder the partition list so that a different broker is first [14:13:14] its kinda annoying thoughm, have to generate and then edit some json, submit it to kafka, wait and let it do its thing [14:13:42] _joe_: i'm only catching up here on the actual problem, what is the symptom we are fixing? high latency produce requests to eventgate from MW? [14:13:52] <_joe_> yes [14:13:58] <_joe_> huge latency compared to eqiad [14:14:20] _joe_: since there was a lot of scrollback -- the produce latency is what we think is causing the huge regression in edit save time? [14:14:28] cdanis: so from https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgId=1&from=now-2d&to=now&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All (where everything started) it seems that eventgate's producer latency to 2003 is higher than [14:14:34] what it should be, and the host started to serve a lot more traffic right after the switchover [14:14:46] <_joe_> cdanis: yes, and we kinda-have proof [14:15:00] <_joe_> after I moved back restbase-async, things got quite better on both fronts [14:16:15] wow that's crazy [14:16:16] <_joe_> cdanis: https://phabricator.wikimedia.org/T261846 [14:16:57] well, adding the partitions won't hurt, is something we wanted to do anyway [14:16:59] so its worth a try [14:18:03] so I have a dumb question about Kafka -- additional partitions are pretty lightweight, right? is there ever a reason to limit a topic to just one partition? [14:18:15] elukey: _joe_ ok to do that? elukey we'll increase partitions to 3 for eqiad|codfw.resource-purge in all kafka clusters, ya? [14:18:27] cdanis: ya they are lightweight [14:18:30] the main reason to have one [14:18:32] is strict ordering [14:18:38] ah [14:18:41] messages are only strictly ordered within a single partition [14:18:54] do we have services that rely on strict ordering? [14:18:59] <_joe_> which isn't a problem for purges [14:19:20] i think there are some [14:19:24] can't remember off the top of my head ho [14:19:25] tho [14:19:33] in some newer version of kafka [14:19:46] they have (or are) implementing reads across all replicas [14:19:51] from my perspective I'd rather have the default be the more-scalable way, and have documented exceptions for the things that need better consistency guarantees at the expense of scale [14:19:51] which will help with this a lot [14:19:55] but, shrug, this is all a tangent [14:20:02] it'll allow consumers to read from readonly replicas too, not just the leader [14:20:13] right now all writes and reads for a single partition all come from the same broker [14:20:25] there's no reason why reads couldn't be balanced between all the replicas [14:20:35] cdanis: yeah, but we can't decrease the # of partitions [14:20:40] and topics are auto created [14:20:46] ack [14:21:00] we can set the default # of partitions for a new topic [14:21:11] but we can never go back [14:21:26] to solve the ordering problem with multi partitions, people partition by key [14:21:37] so that things that need to be in order are always in the same partition [14:21:53] and so that the same consumer instnace will always process the messages with the same keys [14:22:01] so [14:22:03] kafka topics --alter --topic codfw.resource-purge --partitions 3 [14:22:08] kafka topics --alter --topic eqiad.resource-purge --partitions 3 [14:22:20] in main-codfw, main-eqiad and jumbo [14:22:34] <_joe_> jumbo? [14:22:36] <_joe_> why? [14:22:42] we mirror those topics [14:22:57] looks right to me [14:23:11] yeah it probably wouldn't matter in jumbo, but it would be good to keep the # of partitions in sync between all clusters [14:23:20] newer versions of kafka mirror maker do this automatically [14:23:21] yep +1 [14:23:27] we have to do manually [14:23:32] <_joe_> I'm running sudo cumin 'R:package =purged' 'curl -s localhost:2112/metrics | fgrep rdkafka_consumer_topics_partitions_rxmsgs | cut -d \ -f 1' [14:23:39] <_joe_> to see if purged picks up the new partitions [14:23:43] cool [14:23:53] ok proceeding [14:24:03] _joe_: maybe also tail logs of one of the processes, if rdkafka is logging there, it should say seomethign about rebalancing [14:24:15] or, resubscribing to pick up new partitions [14:24:44] codfw done [14:24:54] <_joe_> elukey: wait now [14:25:06] <_joe_> I still don't see the partition used [14:25:18] <_joe_> ok now I see it [14:25:19] elukey@kafka-main2001:~$ kafka topics --describe | grep codfw.resource-purge [14:25:22] Topic:codfw.resource-purge PartitionCount:3 ReplicationFactor:3 Configs: Topic: codfw.resource-purge Partition: 0 Leader: 2003 Replicas: 2003,2001,2002 Isr: 2003,2002,2001 Topic: codfw.resource-purge Partition: 1 Leader: 2001 Replicas: 2001,2002,2003 Isr: 2001,2002,2003 [14:25:23] <_joe_> on 1 host [14:25:27] Topic: codfw.resource-purge Partition: 2 Leader: 2002 Replicas: 2002,2003,2001 Isr: 2002,2003,2001 [14:25:28] <_joe_> let's wait a bit [14:25:38] sure [14:25:52] <_joe_> for now just ulsfo servers picked up the change [14:26:24] <_joe_> it should take ~ 2 minutes for it to show up in the metrics [14:26:34] <_joe_> let's also see what's the effect [14:28:01] <_joe_> elukey: uhm it only shows up in the servers served by codfw [14:28:03] <_joe_> makes sense [14:28:15] <_joe_> you need to do the same for the same topic replicated in eqiad [14:28:20] yep [14:28:25] can I proceed? [14:28:31] <_joe_> yes [14:29:47] done [14:30:07] <_joe_> can you please check the producer/kafka side of things? [14:30:35] <_joe_> wait elukey [14:30:58] <_joe_> you changed the partitions for codfw.resource-purge in codfw and eqiad.resource-purge in eqiad? [14:31:13] <_joe_> you should do both in both dcs [14:31:19] I did both in both dcs [14:31:27] <_joe_> then something is not right [14:31:40] FYI in eventgate-main logs i see things like [14:31:42] Topic codfw.resource-purge partition count changed from 1 to 3 [14:31:51] <_joe_> in eqiad? [14:32:04] <_joe_> anyways, let me verify [14:32:11] that is eqiad [14:32:17] no sorry [14:32:17] codfw [14:32:19] checking eqiad [14:32:26] https://grafana.wikimedia.org/d/000000027/kafka?viewPanel=19&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&from=now-1h&to=now looks good [14:32:33] ya eqiad: [14:32:34] Topic eqiad.resource-purge partition count changed from 1 to 3 [14:33:29] the graph with eventgate latency seems improving btw [14:33:49] last datapoints are ~150ms [14:34:00] i don't see any messages being produced to in the other partitions in eqiad though [14:34:50] OH [14:34:51] i do in codfw [14:35:04] <_joe_> ok all purged converged [14:35:13] huh [14:35:39] <_joe_> and yes, only codfw partions are creating new messages [14:35:46] <_joe_> the eqiad new partitions are empty [14:35:52] <_joe_> rdkafka_consumer_topics_partitions_rxmsgs{client_id="purged",partitions="1",topics="eqiad.resource-purge"} 0 [14:35:57] yeah that is strange [14:36:03] plenty of messages in eqiad [14:36:19] eventgate-main is the only thing that produces to resource-purge, right? [14:36:26] cp and restbase don't, right? [14:36:32] (or do they?) [14:36:41] <_joe_> ottomata: restbase does I think? [14:37:39] I'd expect it to produce to eventgate-main though [14:37:45] not directly to the topic [14:38:07] not required, cp/restbase are nodejs and can produce directly [14:38:33] yes sure but they skip validation etc.. right? [14:38:42] (like I was doing with netflow) [14:38:45] yes [14:39:28] i'm going to try to restart an eventgate-main pod in eqiad and see what happens [14:39:59] oh i see a couple of messages in eqiad partition 1 and 2 [14:40:00] haha [14:40:04] all wikitech??? [14:40:15] ??? [14:40:27] <_joe_> that's not surprising [14:40:32] <_joe_> wikitech is still in eqiad [14:40:36] <_joe_> with eventgate-main repooled [14:40:40] <_joe_> it's producing there [14:40:49] HMMMMMMM ok can we tell if the messages in eqiad are all from restbase / cp then? [14:41:11] how do we tell, are the URLs the api/rest_v1/ ones? [14:41:29] i really only see api/rest_v1/ URLs in eqiad.resource-purge [14:42:13] <_joe_> yes [14:42:14] (bbiab) [14:42:21] <_joe_> and there is a tag [restbase] [14:45:39] <_joe_> ok so [14:45:48] <_joe_> save timings are down below last week's average [14:46:05] <_joe_> Krinkle: can you confirm things are ok in the last ~ 15 minutes? [14:46:13] <_joe_> I would like to undo my changes from this morning [14:46:19] <_joe_> and re-failover everything to codfw [14:52:22] _joe_: ish... I can confirm that were are in one of several dips I've seen over the past 48 hours [14:52:23] https://grafana.wikimedia.org/d/000000085/save-timing?orgId=1&from=1598826341481&to=1599058322480 [14:52:49] <_joe_> which corresponds in timing with the reduction in latency in eventgate [14:52:52] <_joe_> like the ones before [14:53:04] <_joe_> only this time it's because of an infrastructural change we just made [15:23:44] <_joe_> elukey, ottomata numbers kafka don't look that great though https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=codfw_resource-purge&var-kafka_broker=All&var-kafka_producer_type=$__all [15:25:41] <_joe_> https://grafana.wikimedia.org/d/000000362/save-timing-alerts?orgId=1&from=now-1h&to=now has recovered though [15:28:11] (we are in meetings so there will be some lag :) [15:30:04] elukey: BTW we should do resource_change too ithink [15:30:11] yep I agree [15:30:26] even if metrics for main-codfw are way better (bytes out) [15:30:48] but 2003 is still handling more traffic [16:13:43] just to add joy, we had a big mcrouter tko event today [16:13:43] https://grafana.wikimedia.org/d/000000549/mcrouter?viewPanel=9&orgId=1&var-source=codfw%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&from=1599048014710&to=1599049141932 [16:14:17] it seems related to 4 memcached shards, but I didn't see the usual nic saturation [16:14:33] nothing to do now, it was a one off, but let's monitor those metrics [16:14:35] oh exciting [16:15:16] (the gutter pool handled correctly all) [16:15:56] yeah I was just admiring the appserver 95p latency graph, very smooth recovery [16:16:41] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&from=1599048014710&to=1599049141932 for the same timestamps [16:16:57] and I'm sure not used to switching to codfw when I open that :D [16:19:52] we should probably update the default variable values :) [16:21:48] yeah I was just looking into that [16:22:12] when you 'save dashboard' there's tickyboxes to also save the time range or variable values [16:22:39] oh is THAT where that is, thanks [16:22:47] I was rummaging around helplessly in the settings [16:23:25] done I think! thanks [16:23:46] lgtm [16:54:22] I don't have time to look right now, am busy for the next hour or so, but surfacing https://phabricator.wikimedia.org/T260067 here for possible switchover latency issue [16:54:27] rzl ^ [16:55:36] looking [16:57:45] yeah I had the same error. sorry for not reporting it since it went away [17:47:19] _joe_: save timing is now at pre-incident levels for 1h straight for the first time since 3 days ago. [17:47:26] previously only 40min at most or so. [17:47:52] is the measure being toggled on/off or is something that keeps re-regressing? [17:49:22] Krinkle: restbase-async is back to eqiad and eventgate-main is back to active/active [17:49:28] as of about 15:55 today [17:50:39] hm.. right that's when the last recovery started, and fully realised at ~ 16:40 UTC [17:51:07] so, we've worked around it but we still don't understand the problem [17:51:15] ack [17:51:28] Was somethign similar done around 14:30 today? [17:51:52] and 11:00? [17:52:19] same traffic changes at 10:30 yeah [17:52:24] if not, that might help in narrowing what also/undernath might be happening. [17:52:44] anyway, will let you know if the situation changes. and let me know if you need any additional data or intepretation on save timing :) [17:52:45] 14:30 no but I see some kafka-related SAL entries from elukey right then [17:53:19] i don't know the timeline of restbase-async dc [17:53:19] routing changes [17:53:19] https://grafana.wikimedia.org/d/000000085/save-timing?orgId=1&from=1598837447840&to=1599069190273&var-group=group1 [17:53:36] see the brief dips in the top graph [17:53:54] but elukey 's log were supposed to help with this problem too, that should have reduced the load on kafka-main2003 and spread the load around the cluster a little better [17:53:56] restbase-async routing timeline: https://sal.toolforge.org/production?p=0&q=restbase-async&d= [17:53:58] 08:00, 11:00, and (finally) 16:00 [17:54:20] j oe was saying that it only sorted helped, but not enough, so he repooled back to eqiad [17:55:22] I was about to log off but if anybody needs to blame me I can take it :D [17:55:37] haha [17:55:43] haha just the opposite, we're blaming you for things improving [17:55:56] don't let it happen again [17:57:23] ahhh okok perfect, have a good rest of the day folks :) [18:28:50] I'm working on uploading a package via `reprepro` on `install1003`. However it looks like `reprepro` isn't installed - is that expected? [18:30:11] ryankemper: the right place to upload packages would be either apt1001 or releases1002, depending which one it is. but usually the apt* [18:30:37] Thanks! Had just noticed that https://wikitech.wikimedia.org/wiki/Install_server mentions that [18:30:40] if docs still say install* then it's because they are outdated and we should fix it [18:30:55] I think apt is the right server here [18:31:00] ok, will take a look at the wiki page. it's because we split these roles/servers into separate ones [18:31:03] I'm working off some old notes of gehel's so that would be why mutante [18:31:21] yea, it used to be correct. but nowadays APT repo and tftp/dhcp are separate servers [18:31:47] Makes sense. Looking through my local shell history I ended up using apt last time, had forgotten [18:32:02] ack [18:53:19] ryankemper: the right place for those notes is probably in the readme in the plugins project. With links to other documentation instead of hardcoding the server names. [18:53:27] thanks for fixing the documentation bugs! [20:15:57] this is possibly something I am overthinking but I will ask anyway: does Puppet (or our implementation of it) have some concept of a pre-merge hook (on the agent)? [20:16:15] I cannot do this in the CI as it will be difficult to emulate the environment on the actual host [20:16:45] specifically, I am looking for some way to trigger an external command before I run agent -tv on the host to apply the change, and if that command fails (returns an incorrect code), then Puppet can complain accordingly [20:17:07] so far I am doing this manually (the command I run checks if the config is correct) but I was wondering if I could automate this [20:17:21] sukhe: do you mean for installing a new configuration file? [20:17:48] https://puppet.com/docs/puppet/5.5/types/file.html#file-attribute-validate_cmd [20:17:54] cdanis: for making changes to a configuration file and then checking through an external command (dnsdist --check-config) if the config file structure is correct [20:18:05] yeah, you want validate_cmd [20:18:16] ah yes! indeed, thank you [20:18:45] np :) [20:23:17] ottomata: are you mid-puppet-merge? [20:24:40] ah, nevermind, lock is now released [20:28:15] sorry [20:28:22] was tutorialing with razzi :) [20:28:34] o/