[02:48:25] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:25] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:25] FIRING: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:25] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:25] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:37] FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:29:52] Are es read only hosts being touched? I would like to regenerate backups after 5 years, but only if there is no expected maintenance (it will take multiple hours to complete) [06:30:49] last time it took 16 hours on each section [06:31:48] I'm not in a hurry, but it is needed before decomission of old backups hosts [07:04:29] marostegui: Amir1: can I start s4 in codfw? [07:15:02] federico3: I wouldn't do it as it's Friday [07:16:41] db2151 is spamming its logs with: 2025-06-06T07:16:16.848170+00:00 db2151 mysqld[3415384]: 2025-06-06 7:16:16 3143225 [ERROR] Incorrect definition of table mysql.column_stats: expected column 'hist_type' at position 9 to have type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB','JSON_HB'), found type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'). [07:16:53] and is about to run out of space [07:37:00] marostegui: ok [07:37:39] marostegui, Amir1: when you have a minute can you review the changes on https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 so we can move forward? thanks [08:20:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on db2151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:25] FIRING: [2x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on db2151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:41] looking [09:36:49] I said above what it was [09:37:38] I'm checking the depooling status [09:38:02] icinga alarmed by the alarm did not show up here as "ALERT"... why? [09:41:35] ok it's depooled now [09:43:11] I suggest we decide between 1) we want to investigate the error or 2) we instead want to prioritize cloning the host and putting it back in production [09:48:18] looking at the load on other databases I don't think we are under heavy load requiring immediate clone + repool but in any case we want to clone the host anyways [10:22:37] FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:25:30] Yeah, DC-Ops didn't actually make any changes to that system yesterday, so it still has the dead disk unswapped and a real disk that got swapped out for a blank, which I've not yet started using in the hope they could put the disk back. I've updated T395990 to hopefully provide more clear instructions [10:25:31] T395990: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990 [10:26:07] I'll extend the silence until Monday, no point everyone getting bothered over the weekend, and try and sync up with u.random when he's in later to try and get this fixed while I'm OoO [10:27:55] Thanks Emperor, both your attending and your communication is very useful [10:28:36] hopefully your oncall week is treating you more or less well [10:29:46] marostegui: I've opened https://phabricator.wikimedia.org/T396200 [10:30:14] Oh, DBAs, are you expecting problems with db1153 and db2143? I see them both alerting for puppet has failed to generate resources [10:31:10] Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the name 'mariadb::parsercache::shard' [10:31:15] ^ Emperor [10:33:19] maybe fe20621159c21488cb68 ? I am not sure. My guess is manuel will now when back. [10:33:57] sure, I just thought I'd flag it given I was looking at alerts anyway :) [10:34:07] you did well [10:34:24] I think manuel está de lunch [10:34:35] is on* [10:34:48] if not, I will file a ticket to follow up [10:36:39] (sorry, as soon as I switch on a word to a different language, my brain cannot go back :-P) [10:37:33] I solve this problem by being hopeless at all languages other than English ;p [10:37:55] Emperor: I will show you Toku Pona soon to fix that [10:40:22] Emperor: I'll check thanks! [10:40:31] It's related to a change I did yesterday yes [10:45:32] yeah, I was about to say [11:47:29] federico3: Can you ack db2151 alerts? [11:48:10] I created a silence for 7 days on alertmanager but evidently it was not enough :) looking [11:49:10] ah we probably want to silence the whole host [11:50:17] federico3: What was the cause of / being full? [11:56:10] marostegui: I think there could be a combination of few things 1) the incorrect table definition (and I haven't investigated where the table definition came from) 2) ...it then spammed the logs filling root due to /var/log/syslog 3) there's no limit of syslog filesize growth 4) there was no alert here so it kept growing for hours [11:57:55] federico3: I think the problem is that mysql_upgrade wasn't run after the upgrade (which I did yesterday) [11:57:58] I just ran the upgrade [11:58:04] jynus mentioned doing upgrade [11:58:06] let me truncate the log and see if it keeps going [11:58:39] I truncated syslog.1 [11:58:53] the /srv partition looked ok (as in not filled) and mysql itself did not crash so we *might* get away without cloning but I understood that we usually clone hosts after incidents [11:59:19] I think the issue is fixed, but it doesn't take long to clone, so let's do that [11:59:22] I will update the task [12:00:01] actually cloning won't work if /srv is preserverd, as it is a mysql_upgade issue BTW [12:00:17] sorry, I meant reimage [12:00:24] cloning will solve it [12:00:37] jynus: That's exactly what I mentioned above [12:00:38] I hope you understood what I meant [12:00:48] sorry, I was about to leace [12:01:07] Updated: https://phabricator.wikimedia.org/T396200#10890676 [12:01:11] remember I will not return until Tuesday [12:01:25] FYI [12:01:36] jynus: get some rest [12:01:41] (I was just filling in the doc to go away) [12:01:47] :-* [12:02:38] if any dba has time on monday, please look at a good planning for long es ro backups [12:03:01] as that is blocking me (but probably not until Thursday): https://phabricator.wikimedia.org/T387892 [12:08:09] marostegui: I would suggest addressing the aspect of syslog limit and alerting on FS space maybe in 2 independent tasks. Prometheus should be able to provide predictive alerts in advance before the disk space gets close to zero [12:08:58] federico3: I agree, I'd suggest to create a task with o11y about it and at the same time reclone the host on monday to put it back into production asap [12:09:06] also I can start the cloning now but pool the host in on Monday to be on the safe side? [12:09:32] Yeah [12:18:15] can I clone from 2158 ? Anybody doing something on it? [12:19:17] Go for it [12:56:59] marostegui: regarding https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3#note_144577 the create user / grant cannot be tested in a transaction - where can I test it? [12:58:13] federico3: db2230 [12:58:55] it could be really useful if the script did the change first on db2230 and then the other host after asking confirmation, does that makes sense? [12:59:08] I am not sure i am following [12:59:17] which other host? [13:01:23] I mean right now the script does a big for loop on all databases - instead we first do the change on db2230, then ask_confirmation("...") if everything went well and apply the change on the prod hosts [13:02:06] federico3: Isn't it easier to check on db2230, make sure all looks good (including trying a connection etc) and then we can just run the script everywhere? [13:05:02] easier I'm not sure... however if we test stuff by copypasting queries by hand we are not testing the script itself but I'm ok with both options [13:05:45] federico3: Can you modify the script to only iterate on db2230? [13:05:58] yep it takes 1 minute [13:06:33] That is what I was implying above [13:08:41] Amir1: in other scripts e.g. schema change you also added a check by doing a select IIRC, should I do the same? [13:09:23] sorry I don't follow [13:11:01] Amir1: in this script https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3/diffs#ba9713e5a86ff08960237852179b99e6e21ec9ea_0_69 after doing GRANT ALL ... I can do a SELECT to check the grant worked well and stop immediately otherwise [13:11:48] yeah, you should do that, it would make the script idempotent [13:12:33] :D I meant a check *after* the change, but if we also do it before it then yes [13:12:57] before is better IMO [13:13:16] after is meh, we can see it in the next omg report if something is missing [13:13:58] I'd say both: if the change fails on one host we should not execute the same broken change on all DBs [13:14:34] I agree, if the change fails, we should pause and see/report why and on which host it failed [14:28:30] hola [14:28:39] hola [14:28:44] hey [14:29:14] replacing the disk in the second enclosure that is blinking [14:29:27] i have the drive i pulled before set aside and i'll swap it back in afterwards [14:29:39] cool, thanks [14:29:51] wow, my ssh session just dropped [14:29:58] that can't bode well [14:33:31] replaced [14:33:41] so... did it just power cycle? [14:33:46] not sure why but it looks like it did [14:34:09] T_T [14:34:13] yeah [14:35:29] it's not coming back up. [14:35:48] it's complaining of offline or missing virtual drives with preserved cache [14:36:03] that would fit, yes. [14:36:25] (we'd normally not reboot with it in this state) [14:37:06] I think it's saying to press any key to continue, or 'C' to load the configuration utility [14:37:18] I'm attached to the console via the webui [14:37:30] Cool. Follow your nose, I guess [14:37:49] so run screaming from the room and never return? [14:38:21] * urandom sighs [14:38:58] "any key" (Enter in this case) opened the configuration utility 🙄 [14:43:23] ok, as I expected, the output was a little scrambled by the virtual console... it's any key to enter the configuration utility, and no options otherwise [14:43:34] so we're expected to resolve this before continuing [14:43:54] so... I guess we have to clear the preserved cache? [14:45:47] Did JennH put the previously-swapped-out-but-good drive back? [14:46:02] not yet [14:46:10] If not, now might be the time to do that and then powercycle again. [14:46:18] i will do it now then [14:46:32] On the basis that then it ___might___ stand a chance of rebuilding it again, whereas once we discard the cache it'll be gone. [14:48:34] I guess that was 00:02:05 [14:48:51] I saw its state change to "Foreign" [14:49:04] oh, and also 00:02:04 now [14:49:17] i'm gonna stop [14:49:28] i thought i knew which one was the old one [14:49:32] and i was wrong [14:49:40] i'm very sorry [14:50:15] Ah, OK. Yes, we now have 2 ready, 2 foreign, 20 good [plus the 2 OS disks] [14:50:34] i'm gonna run from the room screaming now [14:50:54] JennH: don't stress, swift keeps 3x everything and can cope with a system going pop. [14:51:22] urandom: I suspect you may have to clear the foreign config on those drives; if we end up with them wiped it's not the end of the world [14:52:05] any idea how to do that? [14:53:17] Honestly, no. Once booted, I think 'sudo megacli -CfgForeign -Clear -a0' [14:53:43] Force Online? [14:53:49] Yeah, give that a go [14:54:00] https://usercontent.irccloud-cdn.com/file/tSJBatbz/image.png [14:54:15] those are the options [14:56:37] gah [14:56:56] I cannot figure out how to activate any of the things in that menu from the console [14:57:03] keybinding issue maybe? [14:57:28] Try rebuild? [I'd usually expect up and down arrow and then Return] [14:57:51] up & down arrow did not work [14:58:07] you in via web or ssh? [14:58:10] and now we're powercycling again because all managed to do was escape [14:58:22] I was doing it via ssh just now [14:58:33] tab didn't work either [14:58:43] nor page up/down [14:59:15] huh [15:03:00] no, I can't for the life of me navigate this thing [15:03:07] want me to have a poke? [15:03:17] (I may fare no better of course) [15:03:25] damn [15:03:36] yeah, go ahead, but it needs to be power cycled again [15:03:45] I exited it again... [15:04:26] is it currently rebooting? [15:04:32] ah, yes, I see it is [15:04:39] yes because I exited the config utility [15:04:43] and it forces you to [15:05:00] I didn't intentionally exit.... but the keyboard bindings are... interesting [15:05:13] it only takes a second to get back there [15:05:24] Emperor: are you attaching to the web ui or ssh? [15:05:35] each is its own unique hell in this regard [15:09:11] ctrl-n & ctrl-p will move you through the menu items at the top fwiw [15:09:21] like to PD mgmt [15:09:54] but you're activating "dialogs" on those items which is more than I was able to do [15:10:11] there we go... can you navigate that? [15:10:15] oh you can! [15:10:21] how are you moving through those! [15:10:22] ? [15:15:54] via the web iDRAC, and using the "keyboard" button at the top to bring up the keyboard [15:17:53] o.O [15:18:15] OK, eventually found how to clear foreign config (go to Foreign view, move up to be on the controller, F2 for operations) [15:18:24] Tried importing, but no joy, so tried clearing, now rebooting [15:18:31] yeah, I was watching [15:19:01] A wrinkle is you keep having to click on the display again because the keyboard focus ends up on the top buttons quite a bit. [15:19:29] 🤦‍♂️ [15:19:54] we're back to the preserved cache bit [15:23:50] OK, found how to clear that (into VD mgmt, to controller, F2, manage preserved cache, Delete, YOLO) [15:25:10] that looks promising [15:25:24] insofar as getting it booted anyway [15:28:12] ok, so the order of everything is really bad [15:28:16] right, let's try a CfgEachDskRaid0 [15:29:52] my ssh session froze again [15:29:58] the a/b swap will mean puppet never being happy [15:30:21] ok, nevermind...back [15:30:56] marostegui, Amir1 I spoke with godog a bit about enabling disk space predictions: it would have spotted the issue in advance: https://grafana-rw.wikimedia.org/d/419f8741-4177-49bc-939d-6ee002ac9b70/mariadb-free-disk-space-predictions?orgId=1&from=now-2d&to=now&timezone=utc [15:31:59] urandom: very sorry, my taxi shows up in 5m, I've got to go. I'd suggest rebooting until a/b are the right way round (the rest matter less), and then trying to make sure you've got each LABEL in fstab on exactly one disk :-/ [15:32:26] [you may have to stop puppet and muck around making the fs yourself as in my email if the drives don't come up in the right order] [15:32:26] you mean /dev/a & b? [15:32:59] ok, enjoy your time off! [15:33:00] yes, if /dev/sda3 isn't /srv/swift-storage/sda3 (likewise a4 b3 b4) then IME puppet will never be happy [15:33:12] sorry to run off with it still not fixed [15:36:50] blkid will tell you what LABELs are where (other than {a,b}{3,4} order matters less as long as there's only 1 of each); if puppet does the wrong thing you may need to wipefs and then mkfs - once puppet/you have made all the filesystems, mount -a and run puppet again and it should be GTG [20:20:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2158:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:46:48] FIRING: [2x] PuppetFailure: Puppet has failed on thanos-be2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure