[08:17:19] Amir1: are you still running thumbnail-deletions on the ms frontends (per T379942 )? Would it be too inconvenient to pause that process over the holiday? [08:17:20] T379942: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942 [09:11:22] Emperor: I will pause a couple of them in ms-fe2009 as that's seems to be the main pain point [09:11:55] I hoped it would be finished for those containers by now (it should actually finish soon, let me take a look) [09:12:21] Amir1: thanks :) [09:13:26] it's just I can't rule out the recent odd (and not-typical) unhappynesses with swift being the frontends being a bit over-loaded, so both I'd like to reduce our risk a bit over the holiday, and maybe rule in/out that these thumb-deletions are causing the problem. [09:13:55] "Stop them all for a couple of weeks, see if it goes away" seemed like a reasonable approach [09:15:14] a lot of errors we are seeing are on eqiad too, we don't run the cleaner on eqiad [09:15:51] also right now ms-fe2009 has a lot of free cpu according to htop but doing a swift stat on a container is realllly slow [09:16:10] it's swapping though (might be related might not) [09:16:41] This basically doesn't respond: [09:16:42] root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.04 [09:16:50] it used to respond in 1 second [09:17:05] (used to = three days ago) [09:17:59] yeah, I noticed ms-fe2009 was swapping, which is very unusual [09:18:23] though so is ms-fe2010 [09:19:56] none of the others are. [09:21:21] Amir1: would it disrupt you if I rebooted those two frontends? I could depool / swapoff / swapon / repool instead [09:22:07] I'd have to restart all scripts, it's doable but let me hold the script somewhere [09:22:22] *the command [09:25:44] Emperor: actually for ms-fe2010, if you do depool/swapoff/on/repool instead since I have to readd creds again (puppet wipes it everytime I add it) [09:26:13] for ms-2009, I'm taking things off from the screen [09:31:33] ms-fe2009 is ready for reboot. I won't start anything there until Jan [09:35:26] ack, ta, I'll do that first. [09:45:27] done, now to try clearing the swap of ms-fe2010 without rebooting [09:51:12] Amir1: done, though I notice that ms-fe2010 is still running a bunch of thumb deletions... [09:53:39] yeah, that's why I asked not to reboot them, I stopped ms-fe2009 because restart there is easier (creds) [09:58:36] Hm, let's see how things are looking later today, but I'm feeling twitchy about that [11:27:37] trivial log-grepping for the 504s is turning up that they're all (small sample size) thumbnails, which is making me wonder if thumbor is 100% OK [12:34:09] Hi, we're having issues with ms-be2075 issues (looks like a bad sata cable is causing resets to the drive) and we want to take the node out of production. Docs suggest draining first but I hear that takes days. Advice? We're over on -operations [14:02:12] FIRING: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:17] RESOLVED: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:41] Amir1: obviously there is nothing else happening today, but do you have opinions on T382694 ? [15:03:42] T382694: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694 [17:02:21] Emperor: ms-fe2009 is unhappy while there is no thumbor deletion script running, it might be a hardware issue? [17:04:32] I'm trying to take a look at that ticket [17:04:35] Amir1: there's a bust backend node in codfw - T382707 going to pull it out of the rings a bit later [17:04:36] T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707 [17:04:58] codfw ms depooled ATM (so only getting writes) [17:09:24] wrote a respond there [17:09:32] *response [17:10:40] Amir1: thanks! OK, if I paste the delete command, you OK to +1 it before I do so? [17:12:36] swift delete wikipedia-commons-local-public.88 8/88/Model_4000-First_of_Odakyu_Electric_Railway_2.JPG [17:15:43] Emperor: lgtm [17:22:03] ta [19:17:12] FIRING: SystemdUnitFailed: swift_ring_manager.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:08] [non-zero exit from swift-dispertion-report] [19:22:12] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:17] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:52:12] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:17] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:12] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:17] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:57:12] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed