[08:34:09] <_joe_> good morning [08:38:20] <_joe_> I'm about to add the requestctl web interface to the alerting hosts. I will disable puppet on the active alerting host and reenable it only when everything works fine on the other side [08:39:27] <_joe_> alternatively, is there a poontoon instance where I can test a change to the alerting hosts already set up? [08:43:08] Hi _joe_ ... Yes, there is an instance on Pontoon (phi-alert-01.o11y.eqiad1.wikimedia.cloud), but I’m not sure of the current status. [08:52:24] <_joe_> tappof: ok thanks, maybe it's a good occasion to test setting up a stack myself :) [09:00:35] <_joe_> (which is failing, so I guess I'll go the old way) [09:10:18] yes _joe_, it's not straightforward to do that in just a few minutes... Anyway, none of us are working on the alert hosts, so feel free to proceed with your activities. [09:10:48] <_joe_> ack, the one thing I expect to fail wouldn't even impact the rest of the stuff hosted there [09:11:05] <_joe_> (the deployment of the software, and I'll blame volans for that) [09:12:35] * volans hides [09:54:24] FIRING: SystemdUnitFailed: hiddenparma.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:24] RESOLVED: SystemdUnitFailed: hiddenparma.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:39] FIRING: [2x] SystemdUnitFailed: hiddenparma.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:22] <_joe_> that ^^ is resolved, actually [11:02:54] RESOLVED: SystemdUnitFailed: hiddenparma.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:31] Hi o/ can someone help with https://phabricator.wikimedia.org/T377132 short term? Unfortunately I've uncovered this a bit late so it's blocking and I have near zero knowledge of how the logstash side of things work [13:07:12] I can take a look jayme [13:08:32] tappof: cool <3 - feel free to point me somewhere as well... [13:11:39] tappof: I see the log format on disk changed :/ [13:12:57] updated the task [13:13:24] I think it'a a problem with the regex [13:13:57] the containerd entries start with something like `2024-10-14T13:05:05.129274021Z stdout F {` [13:14:35] while the regex wants lines starting with { [13:14:49] yeah, docker seems to wrap everything into a json {"log": ...} while containerd prints json after the prefix you just send [13:14:55] to parse and apply the tags k8s_docker... [13:15:05] I'll go check if I can change that behaviour in containerd [13:16:49] ^^ Yeah, otherwise we can try to strip the 'prefix' via Logstash, but I think using containerd is better. [13:20:08] tappof: actually the new format is the more "correct" and we'd need to bend containerd to behave like docker had [13:22:15] https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-cri-logging.md [13:37:16] Hmm, maybe I didn't understand. If the new containerd format is more correct, why do we need to make it behave like Docker? Or do you mean that we have to configure rsyslog/Logstash to parse the new containerd log format as well, like we already do for Docker? jayme [13:38:42] that's what I meant, sorry. I think we should add support to logstash to parse it (or rsyslog...I'm always unsure who's job it is :)) [13:42:13] tappof: from what I understand, rsyslog does not really deal with the log contents in this case. It adds metadata via the mmkubernetes plugin and forwards the logs via omkafka [13:42:44] so mayyyybe logstash filters is the right place to handle this? [13:44:38] okok sounds clear Yes, using a Logstash filter could be a good option... but perhaps stripping the prefix on the rsyslog side before pushing it to the Kafka topic could be more scalable. But ... I need to study it a bit. [13:45:32] It's not my only concern... I'm also worried about the P and F flags. jayme [13:45:45] stripping the prefix on the rsyslog side would mean we'd also need to process the additional metadata there, stram type and tags ...yes, exactly [13:46:03] *stream type [13:50:07] Yes, clear. You're right. Do we already have multiline entries in the Docker format as well? [13:50:18] I don't think so [13:50:42] eh ... it could be "fun" [13:50:49] yeah... [13:52:25] we've also had our fair share of rsyslog issues in the past (https://phabricator.wikimedia.org/T357616) and still ongoing ... maybe it's about time to bite the bullet and move to something else [13:53:17] but last time I brought this up is was received rathere sceptical as o11y had not resources to help and serviceops had less of a clue of everything that happens after stuff leaves our worker nodes :D [14:49:02] jayme: I put a patch on Gerrit to remove the prefix from the containerd logs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080047. Maybe it could be a starting point. I'll wait until tomorrow to check with cwhite. Doing a grep on kubestage1003 for 'std.*P \{' didn't return anything. It means nothing for the future, but it was just a quick statistic. [14:50:28] tappof: thanks! The move to containerd is pretty recent, so it's quite possible that we've not seen a multiline log yet [14:51:28] Yes, I can imagine jayme ... as I said ... It means nothing :) [14:53:25] jayme: For the other stuff, like considering alternatives to bring the logs to OpenSearch without using rsyslog, I need to discuss it with the team. [14:54:51] tappof: sure...I'd prefer a quick solution anyways tbh. so we don't have to stop and postpone the containerd migration [14:55:20] I've checked the other containerd nodes as well FWIW - no partial lines (sudo cumin P:containerd "egrep -r 'std.*P \{' /var/log/pods/ |wc -l") [14:57:49] cool jayme ... If it’s okay for Cole as well, maybe we can try this patch, keeping in mind that I don't think this solution handles containerd multiline entries. [15:00:24] might be okay for staging for a while but I would not like to carry this to prod tbh. [15:01:05] we'd probably be silently dropping "invalid" json etc. :/ [15:02:00] many thanks for taking the time and a first step (on such short notice)! [15:21:40] You're welcome. Anyway, I agree with you... the management of multiline entries in this case is my biggest concern. I'll talk with cwhite and the team about possible solutions.