[02:15:10] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:55] FIRING: [2x] SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:17] cezmunsta: do you have a gitlab user? https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/15 [06:00:19] marostegui: can I start the equiad switchovers? [06:05:06] I'd just do 1 per day no more [06:05:09] but yes, go for it [06:05:15] ok [06:34:28] the switchover and schema change are done [06:35:22] great [06:39:26] marostegui: can I switchover s1 in codfw? [06:39:38] sure [07:08:29] marostegui: yes, CWilliams [07:09:08] cezmunsta: Interesting, it doesn't autocomplete when I am trying to add you in a comment on that link [07:09:31] https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/15#note_208360 did that notify you? [07:11:19] Yep [07:13:07] I wonder why it doesn't autocomplete, maybe because you're not part of the project? [07:20:52] yesterday's rclone run was rather impacted by the incident, I'll reset the unit [07:24:55] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:23] marostegui: probably. I have used the "request access" on data_persistence, so you could see if that changes things now? [07:33:34] Just granted it [07:33:48] cezmunsta: now it works [07:34:41] marostegui: :) [08:32:42] I probably forgot, but just to confirm Amir1 x3 is fine with no backups right? [08:32:50] in eqiad that is [08:33:00] Ah no, sorry, nevermind, I saw we do have the sources there [08:33:05] That sounds more normal [08:37:03] https://phab.wmfusercontent.org/file/data/iuqf2j73ezribxadqxvq/PHID-FILE-44spaa56hpvy65g4d2fu/image.png [08:37:51] I was also informed to prepare for x4 backups [08:44:15] marostegui, Amir1: if we are confident with merging https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/13 I can then rebase https://gitlab.wikimedia.org/repos/data_persistence/dbtools/scripts/-/merge_requests/15 over it and implement the fleet-wide "for" as discussed. In the meantime I'm starting the safest reboots for T426633 [08:44:51] federico3: please show cezmunsta the process of the reboots as discussed yesterday [08:45:54] sure [12:57:51] @marostegui p.age for db1249 [12:58:15] federico3: let me know if you need help [12:58:59] it's just a replica luckily [13:00:26] interesting, can you ssh into root@db1249.mgmt.eqiad.wmnet ? [13:00:32] let me see [13:00:57] not even the mgmt console is responding [13:01:05] yep, ping down there [13:01:09] maybe a complete power loss? [13:01:16] maybe, is it depooled? [13:01:56] the depooling just failed , one sec [13:02:05] hey [13:02:15] I'd: depool it, disable notifications (so it doesn't page when it comes back up with mariadb down), create a task and ask dcops to check onsite [13:02:16] you guys are looking at the same thing I see [13:02:30] topranks: something going on on the net? [13:02:32] switch port is down for it anyway, I was gonna try to log on via idrac yep [13:02:44] the port is enabled, maybe a cable gone wrong but that's rare [13:02:47] depooled by hand on dbctl [13:02:52] if it was network it will be up, just lagged [13:03:06] jynus: how can it be up if the network is down? :) [13:03:17] the process will be up, I mean [13:03:19] federico3: Can you disable notifications for it and create a task for dcops? [13:03:23] yep [13:03:24] jynus: yes, but it is not reachable [13:03:24] in their own black whole [13:03:39] mgmt is also down [13:03:58] yes, it is hard down [13:04:03] federico3: thanks [13:05:12] odd both would die, asked in dc-ops was anyone working in the rack [13:05:31] thanks topranks [13:05:37] in terms of severity it is depooled? is there anything else for on-call to look at? [13:05:45] topranks: nah, we'll handle it [13:05:49] it is depooled [13:05:51] you are good [13:05:52] ok thanks <3 [13:06:03] topranks: the switch is the TOR? [13:06:49] dbas let me know if I should pause db maintenance, I was about to reimage a host [13:06:54] federico3: yes "top of rack switch" [13:07:04] A:lsw1-d8-eqiad# show interface * brief | grep db1249 [13:07:04] | ethernet-1/4 | enable | down | 1G | GIGE-T | db1249 {#3385} | [13:07:06] not yet running the onion router [13:07:17] not YET! [13:07:20] :) [13:07:38] the mgmt / idrac is on a different switch / network path, so the fact they are both down suggests power supply loss or similar [13:07:41] (what a lost oportunity to call them "top high of the rack") [13:08:10] jynus: you are ok [13:08:23] marostegui: thanks, will continue with my maintenance [13:08:26] jynus: bit of a thor point that [13:08:57] topranks: that's what I suspected. Is the mgmt interface powered by the same PSU? [13:09:17] yes but the host has dual PSUs, connected to separate A/B power feeds [13:09:23] so.... this is kind of unusual [13:10:09] unusual but not unheard that one PSU failing impacts the other PSU if the latter had some latent issues [13:10:22] re: losing both interfaces at the same time, the only time this has happened to me is when the motherboard was fried [13:10:35] indeed yes, dc-ops didn't respond yet but they'll need to take a look [13:11:00] (assuming the rest of the rack is ok) [13:11:54] yeah nothing for the rest of the rack it switch is up and all ports other than this show up [13:55:16] jynus: db1204 is a backup host, it paged [13:55:23] it is being reimaged I think [13:55:33] [15:46:38] <+logmsgbot> !log root@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [13:55:42] it paged? [13:55:57] why did it remove its downtime while being reimaged? [13:56:53] it alerted in -operations as "mysqld processes #p age on db1204 is CRITICAL: PROCS" [13:57:08] no, it is a bug on the downtime reimage [13:57:19] Exception raised while executing cookbook sre.hosts.downtime: [13:57:28] spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2) [13:57:46] some race condition happened and it failed to be downtimed after reimage [13:58:18] I blame Volans if we were not a blameless culture! [13:59:59] so sorry, people, but I did everything as usual, but for some reason this time the downtime failed after reimage [14:11:04] I think it may be that in general people may not want cookbooks to fail if downtiming doesn't work [14:31:04] I am repooling pc1 with new hw in eqiad [15:14:25] Emperor: nope, that's ok, I don't want that, and that's ok [15:14:56] Emperor: it was that there was a first time error I found while reimaging (downtime failing) [15:32:44] db1249 is alive [15:33:43] did it crash or was it network in the end? [15:35:43] D-: https://phabricator.wikimedia.org/T426750 neiher [15:38:55] :spark(le)s: [15:41:41] I mean, technically, it crashed :-D [15:41:54] it tends to happen when there is no power [15:49:52] BTW, there is a replication error in test-s4, in case it was unexpected [16:05:57] jynus: yes, I'm aware, thanks