[07:53:07] no worries [07:53:09] and greetings [08:18:31] looking into what replica lag alert for clouddb1013 is about [08:19:25] godog: see -data-persistence [08:20:15] taavi: what's the tl;dr ? I'm not in the channel [08:20:44] the mariadb process segfaulted and restarted yesterday [08:20:52] ow :( ok thank you [08:20:59] so I was waiting for them to look before restarting it [08:21:06] in the meantime I'll depool it to move traffic to the other one [08:23:03] ack [09:07:57] morning [09:08:56] o/ [09:11:52] never seen a segfault in clouddbs before... the fact that I restarted it on Friday seems suspicious [09:12:29] I also installed apt updates before restarting, but mariadb itself was not upgraded [09:22:42] I was wrong, mariadb _was_ upgraded to the latest patch version 10.11.16 [09:23:13] I don't know how I missed it in the list of packages to be upgraded, but in any case I would have upgraded it anyway [09:23:42] 10.11.16 is only in 1 production host at the moment, so that sounds like a possible explanation for the segfault [09:24:50] need to run an errand shortly then will take lunch break, bbiab [09:45:46] I created T420177 [09:45:46] T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177 [11:47:16] andrewbogott: I've been reminded we should upgrade the latest cloudgw to trixie [13:40:49] that would be... cloudgw1003.eqiad.wmnet? Or are you thinking of something else? [13:41:44] yes [13:41:58] I can reimage that one right now if you'll be around for a bit. [13:42:49] it's the standby currently [13:44:23] sure [13:44:59] great, here goes [14:35:43] taavi: reimage finished, going to try a failover now [14:36:18] all good [14:37:33] I think that's all of cloud-vps on Trixie except for ceph nodes which are blocked forever by https://tracker.ceph.com/issues/73930 [14:52:50] actually, now I'm going to reboot cloudgw1004 for T419948 since it's now the standby [15:47:23] dhinus: am I recalling correctly that there is no safe way to reboot the clouddumps hosts? [15:47:49] correct. if they come up fast enough, the impact should be negligible [15:47:59] but nfs connections will hang [15:48:58] well we need to reboot the tools k8s workers anyway :D [15:49:04] so might as well do clouddumps first [15:49:09] I guess that's true :/ [15:49:22] makes sense :) I think last time k8s workers auto-recovered after a clouddumps reboot [15:49:30] but if they don't, we have to reboot them anyway [15:49:51] let's save all that for tomorrow when we can get an earlier start [15:50:01] past experiences were recorded in T391369 [15:50:01] T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge - https://phabricator.wikimedia.org/T391369 [15:51:14] OT, we have a few more questions about the azwikimedia/mailcow project that I'm not sure about: https://phabricator.wikimedia.org/T419582#11704348 [15:53:47] I think they're right about the split but they could also just do everything via the floating IP and deal with tls themselves... [15:54:08] And I think they'll need to figure out redirection themselves regardless. [15:56:59] what about the PTR record? [15:58:05] That's something we can create in designate, they should open a subticket [15:58:20] thanks I will reply in the task [16:10:38] all the "web" clouddb* are now rebooted. I will wait a few days before rebooting the "analytics" ones because I'm slightly worried about a repeat of T420177 [16:10:39] T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177