[00:41:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@m1.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@m1.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:29] ^ Fixed those, they come from yesterday's maintenance [08:01:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@m1.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:32] jynus: I think it is very likely m1 backups failed after the maintenance [09:26:41] As I forgot to add the user there [09:26:50] Do you want me to add the user and re-run m1 manually? [09:27:17] let me see [09:27:37] Indeed "Last job for this section: dump.m1.2025-01-14--00-00-02 failed! " [09:27:45] I can add it [09:28:07] what's the host? [09:28:12] db2160:3321 [09:28:29] ok, will take care if you have finished working on it [09:29:29] Yes, it is all done [09:29:31] Thank you so much [09:29:33] And sorry! [09:43:14] Amir1: I am going to depool pc4 entirely [09:46:26] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2234:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:11] pc4 repooled [10:15:50] is there a ticket for the m1 maintenace? [10:17:25] jynus: there is this: https://phabricator.wikimedia.org/T373579 [10:18:27] thanks [10:26:27] I am testing a fix of more granular grants at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1111182 [10:26:41] for 10.6 [10:27:00] using m1 backup as a test. Will merge if it works or tune if it has issues. [10:27:08] thank you jynus [10:27:54] can I get a sanity check at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1111185 ? [10:28:08] let me see [10:30:34] marostegui: all good [10:30:36] thank you jaime! [10:33:31] I will do now db2239, after I was done with dbprov2*, as that was a requirement [10:34:27] nice! thanks [10:41:38] you are working with m5, right? [10:41:54] ah, I saw your message on operations- [10:41:55] now [10:42:01] yeah :) [11:59:28] Amir1: you still working with pc5? [11:59:41] marostegui: I just stopped it [11:59:52] Great, so I can depool it? [12:00:04] go ahead! [12:02:45] ok starting my work there [12:08:11] Repooled [12:08:38] <3 [12:23:21] Replacing the hosts by "non-as-warm" hosts didn't have much impact https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-24h&to=now [12:23:23] So we are good [13:03:58] there are 11 DB hosts left on Bullseye, I'm currently fixing all uses of bullseye-backports (which we soon get archived), none of these will ever get reimaged with bullseye again, right? then I would simply the apt pinning for prometheus-mysql-exporter? all 11 hosts will continue to have the current 0.13.0-1~bpo11+1 running [13:04:28] the alternative would be to copy prometheus-mysqld-exporter 0.13.0-1~bpo11+1 to some internal repo component [13:05:49] or I could also import 0.13.0-1~bpo11+1 to the "main" section of apt.wikimedia.org for bullseye-wikimedia, nothing appears to use the 0.12.1+ds-3 version in standard Bullseye [13:11:05] I am taking care of reimage backup-sources to bookworm, that's 9 of those [13:11:22] the others are probably about to be decomissioned or some other team's [13:11:55] so not worth handling (this is happening immediatelly) [13:11:58] *probably [13:30:17] marostegui: The sister keys helped a lot [13:30:36] moritzm: you have the list? [13:30:41] of the databases that is [13:33:21] you mean the servers running prometheus-0.13.0-1~bpo11+1? [13:33:22] https://paste.debian.net/hidden/aa00f362/ [13:37:14] Emperor: o/ to double check, what are the ms-be nodes that are available for testing? ms-be[12]88? [13:38:59] thanks moritzm that is what I wanted yes [13:39:20] ah no ms-be1091 and ms-be2088 IIUC from profile::swift::storagehosts [13:40:35] moritzm: All those will get addressed by jaime yes [13:43:58] elukey: let me just check [13:45:32] elukey: yes, ms-be2088 and ms-be1091 [13:47:11] okok! I am going to test in there some storecli commands [13:48:58] "0 Failed - Controller does not support JBOD" [13:49:00] very nice [13:58:43] interesting. Sorry I failed to test for jbod when I did my checks, I was mostly thinking about impact for future dbs [14:00:18] nono please not your fault, I think we assumed something was working on that controller, apparently not [14:00:29] I'll follow up with our Supermicro rep to get some info [14:01:04] TY :) [14:07:09] marostegui: ack, th [14:07:10] marostegui: ack, thx [14:08:36] something that just came up to mind - most of the hadoop workers have 2x2.5 SSDs in hw raid 1 mode, and 12 disks set up as RAID 0 volumes since IIRC the controller didn't allow JBOD [14:08:48] and indeed the controller is a relative of the ms-be one: [14:08:49] Broadcom / LSI MegaRAID SAS-3 3108 [14:09:29] we have a cookbook for Data Platform that uses megacli to set various things, but this may be a similar use case [14:10:23] https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk [14:10:51] at the time I didn't know why we used single volumes raid 0, since the cluster was already there, and I never really investigated it [14:11:33] so I suspect that even for ms-be, if we want to have a "proper" hot swap, the option is to use raid0 single disks [14:11:57] I am not suggesting to do it, just thinking out loud :) [14:12:05] I'll follow up with Supermicro to confirm [14:12:07] marostegui: m1 backup failed, but probably was the new grants [14:12:21] jynus: Do you want me to copy the old ones? [14:12:33] no, I did [14:12:59] it is just that I also restricted them a bit, as I though they were wider than it should due to the 10.6 upgrade [14:13:20] I will give it a look, and if it is unclear, return to the original ones and investigate later [14:14:43] another thing I've been really wanting is a more automated grant handling, you know how much I've pested you about that! [14:16:04] [ERROR] - Couldn't acquire global lock, snapshots will not be consistent: Access denied; you need (at least one of) the RELOAD privilege(s) for this operation [14:17:02] elukey: we moved away from single-disk-RAID-0 for ms-be hosts on older kit, I wouldn't want to go back [14:17:34] (it made for confusing failure modes, harder understanding of which disks were where in what state) [14:18:07] Emperor: ack good to know [14:31:19] jynus: is that RELOAD needed new? [14:34:11] no, I removed some that seemed to broad [14:34:22] mariadb added more fine grained ones [14:34:37] for 10.6, but on upgrade it just added all equivalent ones [14:34:55] so it needs to be redone, I am testing it at the same time [14:35:13] that way I can remove SUPER from the dump user [14:35:25] this is an actual nice feature I asked for [14:35:55] but docs for 3rd party tools are probably not well updated [14:36:01] <3 [15:20:31] jynus: for m2, db2160:3322 needs to be moved under the new master (db2233). I can do it, but asking in case you want to do it yourself [15:23:36] please do it, treat those as normal dbs [15:24:42] Wilco [16:15:42] greetings, data-persistence friends - if we're able to get the new conftool release out today for part 1 of T383324, would it be alright if I go ahead and update the `flavor` property on parsercache sections to "parsercache" as planned? [16:15:42] as discussed in the task, this will be a noop for the generated config (i.e., no diff to commit) [16:15:42] T383324: Prevent too many parsercache sections from being depooled - https://phabricator.wikimedia.org/T383324 [16:17:40] also happy to wait until tomorrow, of course :) [16:37:51] +1 from me fwiw [22:04:25] thanks, all! FYI, I did not move ahead with updating the parsercache sections today, since I realized it would make a conftool rollback (albeit unlikely) quite complicated :) [22:05:25] I'll wait until my morning tomorrow before moving forward