[00:06:36] going to restart some purged as well [00:20:52] I was able to mask kafka and kafka mirror maker on kafka-main2003 when it was up, so any further flap should be ok for purged instances [00:21:03] (there might be some purged failing now to fix) [06:58:20] shdubsh: thanks for the fix of my systemctl mask attempt! [07:02:10] just added some downtime to all the host in codfw c7 [07:04:15] shdubsh,herron - do we have plans for https://phabricator.wikimedia.org/T225005 ? having only 3 brokers is a bit scary :( [07:31:45] <_joe_> indeed [07:32:33] <_joe_> elukey: what was up with purged? [07:34:02] _joe_ buongiorno :) so it is something that I already seen with varnishkafka IIRC, namely librdkafka gets stuck if a kafka broker goes down all of a sudden (not gracefully I mean) [07:34:14] <_joe_> ok [07:34:20] the consumer keeps timing out, until you restart it [07:34:21] <_joe_> so we have one broker down? [07:34:37] we do yes, it is on the codfw c7 rack [07:34:41] kafka-main2003 [07:34:50] (currently masked so it doesn't flap for consumers) [07:35:01] <_joe_> what do you mean masked? [07:35:10] <_joe_> you masked kafka?? [07:35:48] I tried before going to bed but I forgot to disable puppet (of course), and Cole re-did it properly [07:36:16] <_joe_> so kafka is just down there, doesn't it make more sense to instead change clients not to use it for now [07:36:30] <_joe_> but still leave it able to replicate topics? [07:36:30] the main issue is that mgmt is not available, so it must be done during the one minute in which the host has connectivity (when for some reason the switch comes back to life again) [07:36:50] <_joe_> oh ok so the whole switch is dead? because I don't see it in icinga [07:37:20] yes I downtimed all the hosts for a couple of days, everything is tracked in https://phabricator.wikimedia.org/T267865 [07:37:39] <_joe_> I don't even see them as downtimed down [07:37:50] <_joe_> and I can easily access kafka-main2003 [07:37:52] for kafka clients it is fine to have a broker down, the main problem is if it flaps [07:37:54] <_joe_> that's why I am asking [07:38:55] yes so the switch is now working, there were recoveries some minutes ago, but it will fail for sure again (it has been doing it for a day basically) [07:39:23] and I think I have downtimed all the hosts in the rack this morning but I may haven't done it properly [07:41:03] <_joe_> anyways I was advocating for moving purged to consume from eqiad, if we need to [07:41:26] ahhhh [07:41:41] okok then I have no idea what it is best, sorry didn't get it at first [12:42:43] klausman: good news! prometheus-amd-rocm is sending love letters again (stat1008) [12:43:14] ah, looks like you already addressed that :) [12:45:59] Yeah, on updates, there is always a window of opportunity, when the old package is gone, but the new one isn't ready yet. We *could* disable the cronjob, update, re-enable it, but I'd rather bother you pointlessly with cron messages :) [12:46:17] that figures :) [13:50:32] elukey: hey, want to meet and sync up about it? [13:56:00] herron: hello! later on is ok? But nothing major, just to have moar brokers, that's it :) [13:56:08] (IIUC we have two ready to go right?) [13:57:42] elukey: ok sounds good, yup mostly a matter of making a plan of attack to rebalance topics across them afaik [14:00:10] ah yes yes! For that I think we can sync with razzi, he's working on it for Jumbo :) [14:19:54] herron: so we are tracking the work to understand how to move partitions in https://phabricator.wikimedia.org/T255973 [14:20:17] but I think that in the meantime we can add the extra two kafka-main[12]00[45] nodes to both clusters [14:20:39] once we figure out a good way to move partitions we do it [14:20:57] I've done it before but i has been many years [14:21:06] in theory its easy, in practice its a little hard to reason about [23:23:11] Headups up, rzl and I are running the updateCollation script on mwmaint1002, we expect enwiki to take up to a week [23:23:26] and we are running one script per db shard