[07:08:12] happy new year folks :) [07:08:53] happy new year to you too! [07:31:52] XioNoX: when you have a moment can you check https://phabricator.wikimedia.org/T271041 and let me know what you think? [07:32:03] I am a little confused about why the host behaves in that way [07:32:08] happy new year everyone :-) [07:38:26] <_joe_> "happy" [07:38:33] <_joe_> I love your collective optimism [07:54:24] it is a wish for the new year to be happy, not an expectation! :-P [07:57:28] I'll join that with another "happy new year" o/ [08:13:40] elukey: sure, looking [08:14:02] * volans here too [08:21:32] elukey: how can I ssh to the host? [08:22:26] XioNoX: you can ssh to the host, it works, it may be intermittent [08:22:30] but it worked yesterday [08:23:33] elukey: ok, from the graphs looks like something happened at 10am UTC on the 30th [08:24:28] yes exactly, I see errors from that moment for the interface, but I have found no real clue about what's wrong [08:28:49] switch logs don't have anything useful [08:29:55] well, some cryptic logs about 1h before [08:35:20] XioNoX: is my understanding right that the default gw for ipv6 comes from RA? [08:35:39] elukey: yep [08:35:56] I can't ssh to the host [08:36:03] I can :( [08:37:45] fwiw it was also listed in T250367 XioNoX [08:37:46] T250367: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367 [08:38:11] elukey: bast2002:~$ ping ms-be2050.codfw.wmnet doesn't work [08:39:11] but works from bast3004 [08:40:24] very weird [08:40:46] could it be that this is due to the interface errors that we are seeing? [08:41:00] (I also don't have any idea what those are about) [08:44:12] what was this task about cloud hosts with broken NIC firmware? [08:44:19] greetings, and happy new year ! [08:45:10] elukey: thanks for taking a look at ms-be2050 ! [08:45:21] ciao godog! [08:45:28] elukey: https://phabricator.wikimedia.org/T269313 [08:46:55] wow [08:48:57] looks like we're running 'firmware_version': '20.8.163/1.8.4 pkg 20.08.04.03, so not the same version [08:51:12] and the dropped packets on the switch side is weird too, but the symptoms are similar [08:52:02] Can a broken fiber between the 10g port of ms-be2050 and asw-d cause the drops? [08:55:49] elukey: looks unlikely [08:56:01] trying to figure out more on why they are discarded [08:56:42] ack ack [08:57:48] I'm seeing the multicast/broadcast TX counters increase [08:59:00] elukey: can I bounce the switch port? [08:59:45] XioNoX: from my point of view I think so, but let's ask to godog to see if this is ok or not (in theory yes since the swift backends can be rebooted anytime) [09:01:25] XioNoX: I think that we can proceed, it will be quick [09:01:27] elukey XioNoX yes please [09:01:32] good to go any time [09:01:32] ah there you go :D [09:01:35] <3 [09:03:26] LMK if I can help with ms-be2050 [09:03:49] done, I can ssh now [09:04:50] it's still dropping multicast packets though [09:06:12] XioNoX: but now I can do "telnet -4 puppet.eqiad.wmnet 8140" [09:06:30] it worked also briefly after the reboot [09:09:25] * arturo waves to everyone [09:09:39] o/ [09:12:02] elukey: so yeah I'd say, try a different patch, then different switch port [09:13:55] elukey: and upgrade NIC firmware [09:22:09] XioNoX: /me ignorant, what do you mean with "patch" ? [09:22:20] elukey: I mean DAC [09:22:24] patch cable [09:22:34] ah yes yes [09:22:42] true, patch means so many things in this world :) [09:23:26] thanks a lot for checking, I'll ask to Papaul (if he is in today) to check the cable.. [09:25:01] <_joe_> elukey/Xionox can you set a priority on that task, please? [09:25:28] yep done [09:29:15] <_joe_> <3 [19:24:12] herron: interested in a walk down memory lane? I just set up acme-chief in the cloudinfra project; we need to change the mail exchanges to use that rather than letsencrypt::cert::integrated [19:24:35] (art.uro has already offered to work on it but it might be super quick for you if you remember how that's put together) [19:32:39] andrewbogott: nice, should be pretty straightforward tbh. essentially including the acme_chief::cert in that profile, describing the cert in the acme cheif hiera yaml for that environment, and pointing the exim config to the live cert with a path based on the acme cheif name e.g. /etc/acmecerts/mx/live/rsa-2048.chained.crt. happy to try and answer questions just lmk [19:33:01] great, we'll see how far I get [19:42:34] andrewbogott: I am not sure if this makes sense but I thougth I should share in case it is helpful: git show 7ee80b5aab a551c82d7c 340e8f085 [19:42:52] (basically what herron said, but with the commits. I remembered I had done this so shared) [19:43:28] thanks! I think I understand most of that [19:46:09] I don't think I know what to use for key_group. [19:46:51] oh nm, I can probably just leave it as it was [19:48:28] the default is 'root' [19:51:39] oh, it's just 'who owns this key'? [19:53:08] yep, the group that owns the key [20:03:14] sukhe and/or herron, maybe it's just this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/654295 [20:06:09] sorry, what's $cert_name here? [20:06:17] andrewbogott: that, plus an entry in the cloudinfra equivalent of role/common/acme_chief.yaml and possibly just defining a static name instead of the cert_name variable [20:06:24] hieradata/role/common/acme_chief.yaml should match it [20:06:42] sukhe: cert_name is just $hostname in the cases I care about [20:07:04] here's my hiera: [20:07:06] https://www.irccloud.com/pastebin/1PCfryaV/ [20:09:15] yep, I think this is about it [20:10:14] yes agreed, although the authorized_regexes are a bit permissive considering the cert is meant for one host only [20:11:54] true — it's a secure project so I don't think it matters. [20:12:30] the next thing is — I don't immediately know how to test whether it's working or not. herron, do you have any interest in merging/applying/validating the change? [20:16:07] sure although now that I'm thinking about deploying this a safer way to go would be to do it in two steps. first patch adds acme_chief::cert, then we run puppet and inspect the cert , and a follow-up patch updates exim to use it once all looks good [20:17:15] ah, good idea [20:17:17] I'll break it up [20:17:54] ok sounds good thx [20:21:07] andrewbogott: ok patch looks good, and that hiera snippet is live in the horizon project hiera? [20:21:23] yep. I'm merging the first patch that creates the acme certs [20:21:27] kk [20:22:17] be warned that puppet is super noisy since letsencrypt::cert::integrated doesn't run anymore [20:25:55] hm, looks like my chief is broken somehow [20:31:22] yeah from first glance seems TLS validation is failing to cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud:8140 [20:32:05] sorry I have to go afk for a bit shortly, but can have another look later on tonight or tomorrow [20:33:05] ok — thank you!