[08:34:09] <_joe_>	 good morning
[08:38:20] <_joe_>	 I'm about to add the requestctl web interface to the alerting hosts. I will disable puppet on the active alerting host and reenable it only when everything works fine on the other side
[08:39:27] <_joe_>	 alternatively, is there a poontoon instance where I can test a change to the alerting hosts already set up?
[08:43:08] <tappof>	 Hi _joe_ ... Yes, there is an instance on Pontoon (phi-alert-01.o11y.eqiad1.wikimedia.cloud), but I’m not sure of the current status.
[08:52:24] <_joe_>	 tappof: ok thanks, maybe it's a good occasion to test setting up a stack myself :)
[09:00:35] <_joe_>	 (which is failing, so I guess I'll go the old way)
[09:10:18] <tappof>	 yes _joe_, it's not straightforward to do that in just a few minutes... Anyway, none of us are working on the alert hosts, so feel free to proceed with your activities.
[09:10:48] <_joe_>	 ack, the one thing I expect to fail wouldn't even impact the rest of the stuff hosted there
[09:11:05] <_joe_>	 (the deployment of the software, and I'll blame volans for that)
[09:12:35] * volans hides
[09:54:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: hiddenparma.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:44:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: hiddenparma.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:49:39] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: hiddenparma.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:59:22] <_joe_>	 that ^^ is resolved, actually
[11:02:54] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: hiddenparma.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:57:31] <jayme>	 Hi o/ can someone help with https://phabricator.wikimedia.org/T377132 short term? Unfortunately I've uncovered this a bit late so it's blocking and I have near zero knowledge of how the logstash side of things work
[13:07:12] <tappof>	 I can take a look jayme 
[13:08:32] <jayme>	 tappof: cool <3 - feel free to point me somewhere as well...
[13:11:39] <jayme>	 tappof: I see the log format on disk changed :/
[13:12:57] <jayme>	 updated the task
[13:13:24] <tappof>	 I think it'a a problem with the regex
[13:13:57] <tappof>	 the containerd entries start with something like `2024-10-14T13:05:05.129274021Z stdout F {`
[13:14:35] <tappof>	 while the regex wants lines starting with {
[13:14:49] <jayme>	 yeah, docker seems to wrap everything into a json {"log": ...} while containerd prints json after the prefix you just send
[13:14:55] <tappof>	 to parse and apply the tags k8s_docker...
[13:15:05] <jayme>	 I'll go check if I can change that behaviour in containerd
[13:16:49] <tappof>	 ^^ Yeah, otherwise we can try to strip the 'prefix' via Logstash, but I think using containerd is better.
[13:20:08] <jayme>	 tappof: actually the new format is the more "correct" and we'd need to bend containerd to behave like docker had
[13:22:15] <jayme>	 https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-cri-logging.md
[13:37:16] <tappof>	 Hmm, maybe I didn't understand. If the new containerd format is more correct, why do we need to make it behave like Docker? Or do you mean that we have to configure rsyslog/Logstash to parse the new containerd log format as well, like we already do for Docker? jayme 
[13:38:42] <jayme>	 that's what I meant, sorry. I think we should add support to logstash to parse it (or rsyslog...I'm always unsure who's job it is :))
[13:42:13] <jayme>	 tappof: from what I understand, rsyslog does not really deal with the log contents in this case. It adds metadata via the mmkubernetes plugin and forwards the logs via omkafka
[13:42:44] <jayme>	 so mayyyybe logstash filters is the right place to handle this?
[13:44:38] <tappof>	 okok sounds clear Yes, using a Logstash filter could be a good option... but perhaps stripping the prefix on the rsyslog side before pushing it to the Kafka topic could be more scalable. But ... I need to study it a bit.
[13:45:32] <tappof>	 It's not my only concern... I'm also worried about the P and F flags. jayme 
[13:45:45] <jayme>	 stripping the prefix on the rsyslog side would mean we'd also need to process the additional metadata there, stram type and tags ...yes, exactly 
[13:46:03] <jayme>	 *stream type
[13:50:07] <tappof>	 Yes, clear. You're right. Do we already have multiline entries in the Docker format as well?
[13:50:18] <jayme>	 I don't think so
[13:50:42] <tappof>	 eh ... it could be "fun"
[13:50:49] <jayme>	 yeah...
[13:52:25] <jayme>	 we've also had our fair share of rsyslog issues in the past (https://phabricator.wikimedia.org/T357616) and still ongoing ... maybe it's about time to bite the bullet and move to something else
[13:53:17] <jayme>	 but last time I brought this up is was received rathere sceptical as o11y had not resources to help and serviceops had less of a clue of everything that happens after stuff leaves our worker nodes :D
[14:49:02] <tappof>	 jayme: I put a patch on Gerrit to remove the prefix from the containerd logs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080047. Maybe it could be a starting point. I'll wait until tomorrow to check with cwhite. Doing a grep on kubestage1003 for 'std.*P \{' didn't return anything. It means nothing for the future, but it was just a quick statistic.
[14:50:28] <jayme>	 tappof: thanks! The move to containerd is pretty recent, so it's quite possible that we've not seen a multiline log yet
[14:51:28] <tappof>	 Yes, I can imagine jayme ... as I said ... It means nothing :)
[14:53:25] <tappof>	 jayme: For the other stuff, like considering alternatives to bring the logs to OpenSearch without using rsyslog, I need to discuss it with the team.
[14:54:51] <jayme>	 tappof: sure...I'd prefer a quick solution anyways tbh. so we don't have to stop and postpone the containerd migration
[14:55:20] <jayme>	 I've checked the other containerd nodes as well FWIW - no partial lines (sudo cumin P:containerd "egrep -r 'std.*P \{' /var/log/pods/ |wc -l")
[14:57:49] <tappof>	 cool jayme ... If it’s okay for Cole as well, maybe we can try this patch, keeping in mind that I don't think this solution handles containerd multiline entries.
[15:00:24] <jayme>	 might be okay for staging for a while but I would not like to carry this to prod tbh.
[15:01:05] <jayme>	 we'd probably be silently dropping "invalid" json etc. :/
[15:02:00] <jayme>	 many thanks for taking the time and a first step (on such short notice)!
[15:21:40] <tappof>	 You're welcome. Anyway, I agree with you... the management of multiline entries in this case is my biggest concern. I'll talk with cwhite and the team about possible solutions.