[06:10:24] got to admit that 10min~ to do a switchover from task creation to task resolve is quite cool <3 [06:28:07] I think I triggered something close to the previous problematic situation on codfw/s3 [06:30:05] not 100% sure though [06:33:15] I'll keep codfw/s3 the topology at an intermediate state (cc Amir1 jynus), lets talk about it a bit when you're around. db2190 move triggered the issue, I fixed it by disabling semi_sync on it (it was not moved under db2205), and doing the same thing on db2205 which is the candidate master [06:36:55] jynus: this time I backed up the diff: T374421 :p [06:36:55] T374421: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T374421 [06:37:12] was done yesterday but was not different this morning [08:16:31] rpl_semi_sync_master_ is off on the new one [08:16:43] it shouldn't be off, that probably caused issues [08:17:15] or, it should be off for clients too, and then enabled [08:17:34] ack, will disable for the replication clients then [08:17:34] is db2205 stable? is it replicating? [08:17:40] it is satble/replicating indeed [08:17:57] and hasn't been killed, right? [08:18:05] I've cloned it fresh yesterday, running properly since [08:18:29] ok, so the issue clearly is the semisync stuff [08:18:59] you should disable it on db2205 (as a client) [08:18:59] it looks like it indeed, but I wasn't feeling confident enough to handle it solo :D [08:19:17] rpl_semi_sync_slave_en <-- this should be off [08:19:23] on the new one [08:19:40] so the idea of checking is preciselly to make it equal [08:19:41] @@rpl_semi_sync_slave_enabled: 0 [08:19:53] on both then? [08:20:15] in reality I think it should be off for dc replication [08:20:30] so both, indeed [08:20:42] intra-dc replication, but that is something to decide with manuel [08:20:50] ah [08:20:55] this I won't tinker with [08:21:08] I'm only disabling it when its causing trouble [08:21:12] but I belive that, as a rule [08:21:21] the workflow should be this: [08:21:36] apply new puppet role -> restart [08:21:41] and then run the procedure [08:21:50] that way they should have a closer config [08:22:00] or if not restart, apply it manually, [08:22:13] I believe this is what happens already [08:22:45] I'll check how this case differs [08:22:48] well, but something must be failing, either on puppet or on restart, because that option should be the same [08:22:59] yeah clearly [08:23:26] e.g. maybe for the lastest verion something is slightly different, etc [08:23:38] for now, disable it [08:23:53] enable it as a master so clients can use it [08:24:22] on the clients that it has already below, make sure semisync is working [08:24:30] move pending clients [08:24:40] and we should be ok, unless it explodes again [08:24:53] I'll check/try, thanks for the hint [08:25:14] I think it must be that [08:25:27] as it is show replica status what fails, not show master status [08:26:15] then we should open a task to test what is the trigger and document it/automate it to avoid it [08:27:04] will backlink the rest of this to it as well, indeed [08:55:39] this is new: arnaudb@cumin1002:dbtools $ sudo dbctl --scope codfw section s3 set-master db2205 [08:55:39] Execution FAILED [08:55:39] Reported errors: [08:55:39] Section s3 has no master [08:56:43] db2209 (former replication source) has the proper config and db2205 also in puppet [08:57:21] is it on dbctl?, what does dbctl say about db2205? is it pooled? [08:58:29] ah [08:58:32] it wasn't pooled [08:59:20] check also heartbeat [08:59:35] orchestrator is saying there is 4 minutes of lag on all of codfw [08:59:43] maybe it is an orch glitch [08:59:48] no no its normal [08:59:53] https://phabricator.wikimedia.org/T374421 [08:59:54] ok [09:00:03] I'll get to the pt-heartbeat part soon [09:07:29] s8 backup took 30 hours [09:07:41] very weird behaviour [09:34:22] we have this new alert set that is being merged: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1053689 [09:35:48] remember to move db2139 when s3 backups finish [09:39:48] * arnaudb creates a task to avoid forgetting [09:39:50] thanks for the reminder [10:51:36] I'm booting up [11:18:12] db1171 has been lagging for a few days now [11:21:00] s8 needs a larger buffer pool [11:21:23] will update the config and restart it when dump finishes [11:31:07] Amir1: will merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072515 ok? [11:31:29] one thing [11:31:50] let me know [11:32:35] yeah, it's slowly recovering, it was because of the schema change [11:32:49] oh, was there a schema change? [11:32:57] yeah [11:33:00] then that would explain it [11:33:02] https://phabricator.wikimedia.org/T367856 [11:33:07] a massive one [11:33:11] still I got a warning for "low buffer pool" [11:33:14] the alter takes 48 hours [11:33:18] and given we have the resources [11:33:35] I want to add those, s8 needs way more than s7 [11:33:49] there was a lot of unalocated memory on that host [11:33:56] oh sure thing [11:34:12] can I copy your comment to the ticket [11:34:30] as that is the main explanation (I was worried there was somethign wrong with the host) [11:34:46] I think I commented there already :D [11:34:47] but the patch I wanted to do it regardless of the explanation [11:34:49] thanks! [11:35:01] as it was a mitigation that would work no matter what [11:35:28] right now s8 dbs hardly fit on multiinstance hosts [11:37:04] So now I understand the need to remove those from mw core [11:38:16] the restart will also help after the poisoning of the b.p. by the schema change [11:41:51] I'm working on getting s8 to a better shape https://www.wikidata.org/wiki/User:ASarabadani_(WMF)/Growth_of_databases_of_Wikidata [11:42:02] nah, no worries [11:42:10] we work with what we have [11:44:25] I will restart now db1171:s7, and s8 on the same host in ~1h, when the dumps there finish [11:44:36] to apply the buffer pool change [11:45:22] sounds good to me [12:47:10] I called dibs on the old s3 codfw master [12:53:59] ack, cancelling the --check, lmk when I can grab it :) [13:09:18] sigh, nah replication to its child is broken [13:10:05] backup source. I fix it [13:18:47] arnaudb: I suggest not doing s3 old master, first db2139:3313 should move out [13:19:13] (let me fix the mess I made because of it). It'll finish soon [13:19:35] ack I'll let s3 aside until your greenlight [13:45:28] db2139:3313 is normal now but we shouldn't touch db2209 until db2139:3313 is moved out of db2209 [13:46:25] I'll move it behind db2205 [13:57:39] thanks [14:08:18] done Amir1 ! [14:08:30] Thanks! [14:09:23] they are doing the alert migration atm [14:09:48] so consider slowing down maintenance until they finish, unsure what the impact to observability could be [14:10:57] ack, thanks for the reminder [15:04:38] Last snapshot for s8 at eqiad (db1171) taken on 2024-09-12 13:13:24 is 1467 GiB, but the previous one was 1546 GiB, a change of -5.1 % [15:42:47] jynus: hey just catching up with your comments on task [15:43:04] the request is for us to do ms-backup2002 and backup2011 asap is it? [15:43:17] that should be fine, we can possibly even do them now before the official start [15:43:18] not in a rush [15:43:24] ok [15:43:49] just, dont keep me until late, so I can pool them back and finish my day :-D [15:44:21] urandom: are you able to depool ms-fe2012, moss-fe2002 & thanos-fe2003? [15:45:47] yup [15:45:54] super, thanks ! [15:46:37] {{done}} [16:00:19] thanks :) [16:00:53] jynus: ms-backup2002 and backup2011 have been moved, so you can start the backups on them again [16:00:57] thaks for the help! [16:05:01] topranks: thanks to you for accomodating! So great! [16:05:23] about to restart backups [16:18:02] urandom, arnaudb: rest of those are done if you want to repool things [16:18:04] thanks :) [16:18:11] 🎉 [16:18:14] thanks topranks [16:19:06] topranks: awesome; thanks! [16:20:01] arnaudb: actually I forgot db2209 [16:20:07] can we still do it? [16:20:12] next round? [16:20:19] I was on my way out :D [16:20:28] sure yeah let's do it Tuesday [16:20:38] topranks: he had repooled it [16:20:45] depool could take some time [16:20:48] all good it was my mistake [16:20:51] yeah there is no problem [16:20:53] thanks guys [22:11:25] FIRING: SystemdUnitFailed: systemd-timedated.service on backup2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on backup2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed