[04:44:05] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T227107 (10Marostegui) 05Open→03Resolved All good now - thanks! ` root@db2049:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DD260) Port Name: 1I Port N... [04:51:15] 10DBA, 10Data-Services: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) labsdb1011 is fully done: ` root@labsdb1011:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 12T 6.2T 5.5T 53% /srv ` [04:51:41] 10DBA, 10Data-Services: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) [05:02:45] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:10:50] marostegui: what do you think about merging, but not deploying the wmfmariadb patches? [07:11:13] that way it is easier to fix issues we find over a better baseline [07:11:36] jynus: Sure, I just asked if you were able to test it yet (even on your local env) [07:11:49] I tested on production too, but not on any core host [07:12:07] everything I tested worked, but I did not test everything [07:12:12] \o/ [07:12:16] because there is too many variables [07:12:18] yeah [07:12:49] Not sure if you saw my comment, but we have lots of hosts in codfw with mysql stopped, just waiting for dcops to decommission, where we can set up a quick replication topology and test there, if you like [07:13:01] that is why I want to merge al patches and say test properly 9df778c4ca3774f8d372ef9 [07:13:11] yeah, but it is complicated [07:13:16] it needs replication, etc. [07:13:41] yeah, that's what I am saying, set up a replication chain and just maybe even pt-hearbeat to generate inserts or something [07:13:43] just an idea [07:13:55] I am doing that with es* hosts [07:14:00] ah cool! [07:14:24] move and switchover can get confusing [07:14:37] but note switchover does more things than move [07:14:45] yeah [07:14:48] switchover should be used for masters only [07:14:53] and move for replicas only [07:14:58] yeah [07:15:03] we need both for m2 on tuesday [07:15:07] as we need to move replicas [07:15:07] then semi sync is complicated [07:15:10] and then switchover [07:15:17] so we can use it there if you feel confident about it [07:15:22] because maybe modules may not be loaded [07:15:29] or may fail to unload [07:15:53] but I tried to break it and sometimes it failed, but nevver broke [07:16:00] e.g. if I set timeout too low [07:16:08] + lag [07:17:51] broke as in? [07:19:10] either it refuses to run, or it runs [07:19:25] with some things being a bit shaky, like disabling semisync [07:19:38] which the module complains it is "in use" [07:20:42] Ah, interesting... [07:20:52] Maybe you can just print a message about it like with the events [07:21:02] oh, it warns it [07:21:28] [WARNING] Number of expected semi sync clients different [07:21:42] and "module failed to be unloaded, will unload on shutdown" [07:22:28] that is why I want to merge- so you can test it yourself [07:22:34] sure [07:22:43] I can do it for the m2 failover on tuesday [07:22:51] Or move some replicas around in codfw now [07:23:00] well, for now I was just thinking on es200* [07:23:05] sure [07:57:34] when cumin2001 comes back, I will rebase only there [07:57:42] cool [07:57:58] and will setup a 3 host test [07:58:05] lovely, thank you [07:58:26] let me know once you are done testing, so I can test too [07:58:27] I want you to also test it because you will try things I haven thought about [07:58:51] and also I want you to ask questions that I may not have thought of answering [07:58:58] sure :) [07:59:12] just let me know once you are done testing and the hostnames so I can test :) [07:59:25] let me also send an additonal patch to fix CI for the other files [07:59:32] cool [08:00:54] jynus: cumin2001 is up and usable again [08:01:01] thanks! [09:21:01] 10DBA, 10OTRS, 10Operations, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) The etherpad is ready with the procedure and ready for a review. The patch is also ready for review: https://gerrit.wikimedia.org/r/#/... [10:40:10] 10DBA: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 (10Marostegui) [10:40:12] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) [10:49:53] https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/520724 [11:30:33] dear DBAs, do you plan to use scap later today for any reason? [11:30:54] I can do something if it is just for testing [11:32:05] yeah, I have to rollout a new version of scap, so it would be great to have something to make sure it works fine after having upgraded it only on the canaries for example [11:32:09] before the full rollout [11:32:19] please note that our usage is limited [11:32:30] no full scap of actual code deploy [11:32:48] the change was to remove an unused conftool integration, so my expectation is that either works or not. and was tested in beta [11:32:59] so fairly confident :) [11:33:25] oh, I do not doubt you, I was just saying that [11:33:40] we will do limited testing of sync-file [11:34:35] sure, I know, it should be enough [11:36:40] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/520732 [11:39:29] great, scap upgraded on mw-canaries, feel free to deploy it whenever you need it [11:41:54] are you so cheap you are not going to share a +1?? [11:42:11] lol :) I can do that [11:42:15] I didn't actually look at the patch [11:42:29] be careful you may spend too many of those! [11:43:02] :D [11:43:59] "A +1 a day keeps the outage away", they say [11:45:23] rotfl [11:46:59] Permission denied (publickey,keyboard-interactive) [11:47:11] 266 apaches had sync errors [11:47:39] moritzm: might be related to teh keyholder one? ^^^ [11:47:47] depoly1001 was rebooted too today [11:47:58] could be [11:48:13] or maybe installing scap reloads it or something? [11:48:17] *upgrading [11:48:20] one of the 2 [11:48:27] I didn't upgrade scap on deploy1001 yet [11:48:29] I did on the canaries [11:48:30] ah [11:48:38] let's move to -sre anyway [13:43:10] marostegui: if you are around, I have to tell you something about wmfmariadbpy [13:43:27] shoot [14:39:52] 10DBA, 10Data-Services: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui)