[06:20:54] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:07] any taker for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295389 ? [08:26:17] and for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295397 ? [08:30:13] looking [08:31:47] thanks [09:11:27] The Debian Lottery on how we are name this drive allowed me to reimage finally a server [09:13:26] I would like to join Linux Random Drive Naming Syndrom Group, Emperor, if you accept my membership 🙏 [09:14:47] if we press hard enough, we may get more intutive and stable names like sdaenp0esp7cylinder87part4, like the network devices [09:19:06] haha [09:19:33] I've ended up doing a chunk of /dev/disk/by-path on swift, which is only moderately horrible [09:20:52] I ended up creating a script to setup new partitions I created automatically by UUID [09:21:02] the SM nodes are the worst for this, you end up with fstab entries like /dev/disk/by-path/pci-0000:98:00.0-sas-exp0x500304801ffa743f-phy0-lun-0-part1 [09:21:43] (but it does mean when a disk fails and you replace it, puppet can Just Do It, and that the paths are all repeatable across reboots reliably) [09:21:54] Yeah, my entries now look: UUID="8b4fe740-85b9-472e-8f3f-bd7826681f9b" /srv/objectstorage00 ext4 noatime,nodiratime,data=writeback,commit=60,nofail 0 2 [09:22:12] but that is only a fix while live, on reimage, the installer does what it wants [09:22:50] Mm, the installer only touchers the SSDs on the swift nodes, puppet does the rest. But you lose SAN points every time you have to touch swift_disks.rb :) [09:36:00] yeah, it is a different case, for me the issue is the changes to virtual disks [09:39:13] jynus: the random drive naming group has been renamed recently [09:40:43] Down with People's Front of Judea. I only support Judean People's Front! [10:20:54] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:50] looking [10:35:44] it's a multiinstance host with 4 /etc/default/prometheus-mysqld-exporter@m1 /etc/default/prometheus-mysqld-exporter@m2 ... reading the right file each /var/lib/prometheus/.my.m1.cnf ... [10:36:48] the "plain" prometheus-mysqld-exporter.service should probably not be needed, has anybody seen this issue before? [10:38:15] federico3: not sure what you mean? [10:38:30] is this a multisource host? [10:38:38] it is, and It's on Trixie [10:38:45] if yes, then disable it once and forget [10:39:05] each /etc/default/prometheus-mysqld-exporter@m reads the related /var/lib/prometheus/.my.m.cnf [10:39:18] but the "plain" prometheus-mysqld-exporter.service also tries to start [10:39:50] the reasoning puppet doesn't know about it is because thechnically one could have multiinstance and single instance combined, and the package by default enables it, so there is no logic to detect it automatically [10:40:21] but there coudld be some advance logic with exports to handle it automatically [10:40:43] or changing the package to not be enabled by default, and only enable the ones setup [10:41:14] so systemctl disable X && systemctl reset-failed [10:41:17] for now I disabled it by and [10:41:40] with sudo systemctl disable --now prometheus-mysqld-exporter.service - but is puppet not going to try start it again? [10:42:04] no, because it is not used by puppet on multiinsatnce, puppet on multiinstance uses @section ones [10:42:18] it is setup by the package automatically, which I think is a mistake on the package [10:42:38] or as I said, a bit more higher level puppet understanding of what services are enabled [10:43:41] the instance config puts up its own monitoring for itslef, but doesn't disable anything because it cannot know if it is being used or not [10:43:55] RESOLVED: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:23] I guess it could be disabled on a multiinstance common.pp [10:44:29] for the profile [10:45:11] but not on the module itself [10:46:44] I think it is this doubt of "should this be automated on post deb or on config management?" and I never got a clear delineation, plus people outside sre do some weird stuff with mariadb [12:32:51] despite all the setbackups, I am now seeing the light at the end of the tunnel for all backup things related [12:34:07] I think it will take me a week or more to renew the bacula es ro backups though [12:34:55] more like 2 weeks, calculations say 48 hours per section, and we have 7 of those [12:37:34] have a nice weekend!