[05:06:00] depooling pc4 to replace its codfw master [07:49:01] Morning folks, could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1288441 please? restore 2 nodes to the rings, drain the final(!) two for VLAN and storage move [07:49:31] done [07:49:40] thanks :) [08:25:58] And another, please? This is eqiad, removing two drained nodes for reimage to new VLAN - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1288458 [08:34:43] done [08:34:55] thanks :) [09:23:12] marostegui: when removing a host from orchestrator, the doc shows either GUI or CLI, but the CLI has a number of steps... so, if using the GUI would the same steps be followed too? [09:23:45] cezmunsta: the gui is way easier, just click on the host and "forget" [09:24:33] so, no manual log to ops for the GUI? [09:28:41] Hmm, I have RO mode showing, so I will use the CLI for the moment [09:29:37] Can you add yourself too modules/orchestrator/templates/orchestrator.conf.json.erb? [09:29:43] So you can have RW on the gui too [09:31:59] Will do, SSH to dborch1001 failed anyway, so I will look at that afterwards [09:32:17] Do you want me to remove the host for now? [09:32:29] cezmunsta: the new host is dborch1002.wikimedia.org [09:32:33] the org may be out of date [09:33:16] OK, that responds .. I can do it via the CLI for now, update the doc and then do the patch for GUI access [09:34:41] thanks [09:44:54] marostegui: can do the switchover of s4 in codfw now? [09:45:15] yep [10:16:03] Two more (I think the last for today!) - https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/19 to teach the ring manager about eqiad c7, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1288496 to put the new nodes back into the rings and drain the last pair, please... [10:17:10] looking [10:21:20] thanks :) [10:59:07] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:07] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:11] ^-- consequence of rebooting all the things [11:14:27] there will be similar for 1009 once the cookbook gets there [11:35:35] marostegui: can I do a switchover of s5 codfw as well? [11:37:10] federico3: yes [11:49:07] FIRING: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:46] db2213 is lagging [11:54:06] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:19] federico3: have you checked what's going on? [11:54:34] just after the switchover, looking [11:55:03] federico3: my bet is that puppet wasn't run and pt-heartberat isn't running? [11:55:33] it's catching up [11:55:54] the problem is that heartbeat has the old entry [11:55:58] was the delete run? [11:56:39] federico3: ^ run the delete [11:57:39] yes I haven't run the delete yet, that fixed it [11:58:00] simply follow the checklist from the switchover and you'll be good [11:59:23] well the checklist does the change in dbctl before fixing the heartbeat, maybe in the helper we could do the deletion as part of the primary switch step [11:59:45] federico3: yes, because that's avoiding downtime [11:59:54] federico3: let's fix everything and we can discuss things later [11:59:59] but please finish the switchover first entirely [12:14:07] FIRING: [3x] SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:07] FIRING: [3x] SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:17] systemctl status on ms-fe1009 says "ok" to me [12:24:07] RESOLVED: [3x] SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:07] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:06] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:10] Amir1 federico3 cezmunsta https://gerrit.wikimedia.org/r/c/operations/puppet/+/1288845 [14:19:50] Amir1: ms-fe* reboots all done, thanks for your patience [14:22:11] Thanks [14:24:07] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:07] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:04] ^-- should be fixed [14:34:07] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:07] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:07] FIRING: [2x] SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:09] FIRING: [2x] SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:55] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed