[08:52:53] headsup, I've started a rebalancing of kafka-logging. I'll intercept any alert in alertmanager and will silence them for ~1 day [08:53:39] ack thanks for the headsup [11:22:39] {{done}} [13:44:28] I'm getting timeouts trying to ssh into deploy2002. Is it just me? [13:45:06] works for me [13:45:07] Connection timed out during banner exchange [13:45:07] Connection to UNKNOWN port 65535 timed out [13:45:14] That looks wrong... [13:45:28] what could cause that? [13:46:27] seems to affect all hosts [13:46:44] I'd say double check your ssh config and try with ssh -v. Also try to ssh directly to a bastion first [13:46:50] easier to debug without the jump [13:47:39] it worked a couple of hours ago, i didn't change my ssh config since then. i'll try bastion [13:48:54] huh. ssh is telling me that the fingerprint for bastion has changed. [13:49:06] I'm seeing SHA256:tUl4qI8SiYtw5ZU6KBgyS2nhcuqjwfbHFYpsuka/LHM. [13:50:44] do you use wmf-laptop package? [13:50:54] no hold on, that's for labs. [13:51:19] you can re-run the wmf-update-known-hosts-production to get the known hosts updated [13:51:24] do... i don't even know what that is.... my ssh conf has just accummulated cruft over the last 15 years :P [13:51:47] https://wikitech.wikimedia.org/wiki/Wmf-laptop [13:52:24] ok, i'll havera look at that [13:52:40] :) [13:53:49] ssh bast1003.wikimedia.org works. ssh deploy2002.codfw.wmnet still doesn't. [13:53:55] which bastion do you use? [13:56:02] duesen: can you run the failing ssh command with `ssh -v` and pate the output? [14:00:22] taavi: https://phabricator.wikimedia.org/P85519 [14:08:43] same thing with bast2003. I can log into the bastion, but I can't get anywhere from there. [14:12:51] taavi, volans: any ideas? [14:17:26] sorry, a bit busy debugging somethign right now, can't look in depth [15:00:40] duesen: OOI, does it work if you use the codfw bastion rather than the eqiad one? [15:01:11] Emperor: which bastion would that be? [15:01:28] bast2003.wikimedia.org [15:01:58] I note that my ssh config (which comes from wmf-laptop) always uses the dc-local bastion host for ssh connections [15:02:20] Emperor: can you share your config? [15:02:31] Emperor: bast2003 doesn't work either [15:02:59] two ticks [15:03:28] hm? [15:03:50] duesen: https://phabricator.wikimedia.org/P85525 [15:09:18] Thanks. I don't see any suspicious differences. [15:09:36] I used my config this morning, with the same host. Then it broke.... [15:09:56] Maybe i should just reboot before digging in more. Could just be cosmic rays. [16:21:56] hi folks - FYI, some time after 18:15 UTC today, there will be a brief (seconds) disruption in network connectivity to conf1009 as it migrates to a new ToR switch. [16:21:56] * read and write operations will continue to succeed against the 2 other nodes (i.e., quorum is maintained). [16:21:56] * that said, clients (e.g., conftool, spicerack) may see transient errors communicating with conf1009 specifically (though these may be retried transparently by the client in most cases). [16:21:56] I'll follow up here when the work is done. [16:41:39] taavi, Emperor: for the record: rebooting fixed it. Looks like ssh just choked in an odd way... [16:42:45] "yay" [17:04:54] !oncall [17:35:00] !oncall-now [17:35:00] Oncall now for team SRE, rotation 247_policy: [17:35:01] t.opranks, c.laime [17:46:11] inflatador: you asked about reimaging; you aren't having Grub issues by chance? If so, consider T407586 [17:46:11] T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586 [17:50:18] andrewbogott interesting, I was focused on UEFI but I am trying to reimage to trixie [17:50:53] probably unrelated, but that one should be fixed now [17:52:14] probably a good idea to try different OS anyway, thanks for sharing