[06:18:24] 10DBA, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) I have merged both changes after the review from @Krinkle (thanks!). Let's see how it goes [08:17:07] marostegui: im not sure what https://phabricator.wikimedia.org/P8014 is that jynus poked me toward, but i cant see it :P [08:17:27] let me add you [08:17:38] thanks! [08:17:44] done [08:17:49] ty [08:21:10] on https://phabricator.wikimedia.org/T214402 is there any way to actually see which servers are being waited for? witout looking into it i thought something stupid was happening like waiting for a gtid from one cluster on another cluster [08:21:44] with that trace no :( [08:23:39] i could probably make it spit out some more details [08:24:05] that might help, although probably there is nothing we can do [08:24:13] but we could check if there is something specific going on with that server [08:24:52] looking at waitForReplication i could also specify some options, domain and cluster etc, and see fi that fixes the issue [08:24:58] i'll have a little dig shortly [08:25:15] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [08:25:16] :) [08:53:32] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [08:55:24] so the snapshot compressed went to 405621235772/1106656852789 = 37% of the original size [08:55:42] that is for s1 compressed but with lot of replication [08:55:54] not sure I understand what you mean [08:56:16] we end up with a 378G tar.gz [08:56:25] from a 1.1TB directory [08:56:33] and what did you mean with replication? [08:56:47] writes that create fragmentation [08:56:51] aaaah [08:56:52] right :) [08:57:12] and make innodb compression less effective than just after alter [08:57:14] and that is innodb compressed no? [08:57:18] yeah [08:57:43] although there could be some uncompressed tables [08:57:56] yeah, I am finding those when transferring stuff to the new dbstores [08:58:05] so probably the same is happening on the backup sources [09:00:41] I think I will try some other instance, as this also informs improvements for the script [09:01:31] if you want to check s4, I just finished a fully compression and it is: 978G [09:02:11] I need some host with triple the space, though [09:02:39] dbstore1005? [09:03:04] Although that would have the same as db1118, 4.4T [09:06:58] I started s6 at 09:06 UTC, piping and compressing at the same time [09:07:48] to pipe an archive with mariabackup, we will not be able to use tar, only xbstream [09:08:26] preprocess somewhere else, then tar and compress again [09:09:27] did you setup s1 or s6 already? [09:09:45] s1 [09:09:56] s6 not yet, but will not be done this week [09:10:06] well, I am proposing to do it now [09:10:28] sure, give me a few minutes, I need to set up the server [09:10:30] although knowing it may be corrupt/will not work/etc [09:11:10] the backup will take one hour probably [09:11:13] if you want to do s2 or s3, those can be done right now [09:11:35] if you prefer s6, give me like 15 mins or so [09:12:01] x1? [09:12:15] haha that is also on dbstore1005 [09:12:19] so I need to set it up first [09:12:28] let me set it up [09:12:51] we can do s2 [09:12:59] just not on dbstore1001 [09:13:00] that is dbstore1004 and that is fully ready [09:13:09] so you can use it [09:13:21] s4 is already up and running there [09:13:28] but s2 and s3 are empty, so you can do it [09:19:16] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [09:31:38] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [09:35:33] s6 466GiB 0:28:41 [09:36:01] 277MiB/s which is doable on 1G [09:36:21] yeah [09:36:31] final compressed size 191G [09:36:55] from the original 499G [09:37:39] that's nice [09:38:15] 38% of the original size [09:39:41] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) s5 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1003 [] dbstore1002 [] db1124 [] db1113... [09:39:53] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [09:50:57] the total compressed right now will be 3TB, do we need the same for misc? [09:51:23] what do you mean for misc? [09:51:28] 3TB for core? [09:51:30] snapshots of misc [09:51:39] yes, s1-s8,x1 [09:51:55] I would say we should also get snapshots for misc, if with the same schedule, that is different [09:52:05] But maybe once a week snapshot (like the logical) would be nice [09:52:05] no? [09:52:31] idk [09:53:06] We do not provision misc hosts really, but it would be nice to have it to fully recover a host faster if needed [09:53:14] at least once a week [10:27:06] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @Cmjohnson you've got any rough ETA for these? Thanks! [10:40:12] So I think I am done calculating space requirements, do you want me to provision one of those dbstores? [10:40:37] sure, you can try s2 or s3 on dbstore1004 [10:40:47] ok [10:41:34] everything is in place, you just need to transfer the data :) [11:32:04] preparing 4GB of log is not fast [12:12:20] Ping regarding this patch: https://gerrit.wikimedia.org/r/c/operations/software/tendril/+/479741 [12:20:39] would it be possible to deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/466833/ in the next weeks? those systems are up for decom, but we should drop the grant first to prevent later reuse of the IP [12:22:58] moritzm: yeah, these days are more calm [12:23:27] no specific hurry, sometime in Feb would also work fine [12:24:07] I will +1 it (let me double check that is not an encarta logo) [12:24:25] and in other questions, how is your javascript/frontend skills? [12:25:01] moritzm: it doesn't depends on us as much as on "is something on fire"? [12:27:28] sure, fully understood. I mostly pinged this when going through my list of Gerrit patches [12:28:33] and please continue doing it [12:29:56] it is the usual thing of "doing it takes time or risk (or both), while not doing it isn't high priority" so it always ends up at the bottom of the pile [12:30:16] ack, sure makes sense [12:30:18] if it was low priority and low risk it would have been done sooner [12:30:37] also I think it took a bit for us to migrate away [12:31:08] as there are (or were) long ongoing tasks on the old nosts [12:31:33] I wonder if we should take the change to do a password update [12:34:37] Amir1: do you want me to deploy it (it requires manual deployment) [12:35:16] I think I will just do it [12:57:03] jynus: As you wish, there is no rush [12:58:02] it is up on https://tendril.wikimedia.org/ [12:58:11] (if you force a cache reload) [12:58:39] jynus: Nice, thank you! [13:21:25] 80m29.917s to prepare the s1 backup :-/ [13:21:46] although to be fair, with no optimization, and dbstore1001 with HDs [13:21:55] more memory == faster recovery [13:24:21] and of course the backup is now useless because I forgot to "--slave-info" [13:24:52] and there is no binary logs on dbstore1001 instances [13:25:09] haha [13:25:11] well [13:25:23] 80min isn't that super bad [13:25:27] we fail on testing so we do not fail on production [13:25:29] keeping in mind dbstore1001 [13:25:57] I can do better [13:26:18] remember I created the best documentation and trainging for xtrabackup [13:26:23] *training [13:26:27] hahaha [13:26:29] indeed! [13:26:37] to the poing people are still using it for delivering talks [13:26:40] *point [13:44:14] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [14:06:58] jynus: Can I populate dbstore1004:s3 ? [14:07:02] I won't touch s2 [14:07:42] mmm [14:07:44] don't like [14:07:51] you could [14:07:57] I am fine waiting :) [14:08:11] but unless it is blockign you, I would prefer to wait [14:08:18] no problem :) [14:08:43] every minutes we spend on doing it right will be simpler later [14:08:56] don't worry, I will focus on dbstore1002 and dbstore1005 [14:09:18] and I don't want both of use accidentally writing to the same port or whatever [14:09:30] don't worry, all cool [14:22:04] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [14:45:11] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) Not until after the all hands. I will move it up on the list. [14:46:02] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) Thank you! [14:59:41] 780GiB 0:54:03 [ 246MiB/s] for s2 [14:59:48] nice [15:00:53] don't declare success yet :-) [15:22:15] prepare with 10GB only took 2 minutes on s2 [15:22:21] on dbstore1004 [15:22:40] nice! [15:28:54] there is someething strage, though, the binlog positions it gave me are db1090-bin.003223 181888686 [15:29:09] while it copied from db1095 and its master is db1066 [15:29:30] :| [15:29:40] maybe gtid_domain_id messing up with things? [15:29:44] db1090 is definitely on db1090 [15:29:46] I have xtrabackup_slave_info, but it is gtid [15:30:24] there are some leftover binlogs on db1095 from when it was copied from db1090 [15:31:08] I can make it work with gtid, but it is not as strightforward [15:32:11] so I am guessing disabling gtid when taking a backup? [15:33:01] yeah, that can be a good idea [15:33:56] or we could cleanup leftover gtids to make gtid work properly [15:34:43] I would go for disabling gtid first and see if that works [15:39:59] gtid works too, just make sure to double check gtid == transaction commited, binlog == gap between transactions [15:54:03] I was checking dbstores and tendril, should I add s1? [15:54:37] no, I was waiting for the compression to finish [15:54:42] cool [15:54:44] so it doesn't show as "NO" all the time [15:54:46] :) [15:56:06] change_tag_def and ipblocks_restrictions seem to be new tables that require compression [15:56:28] although they are probably so small it is not a huge deal [15:56:37] only those? [15:56:45] on s2 at least [15:56:54] select * FROM information_schema.tables WHERE table_schema like '%wik%' and ENGINE='InnoDB' AND row_format <> 'Compressed'; [15:56:57] on s1 there were more [15:59:22] jynus: can you also add dbstore1004:3312 to zarcillo? [15:59:53] see log [16:00:02] I asked because I was foing that [16:00:12] yeah, I am asking as I only saw tendril [16:00:24] just making sure [16:00:26] ok, I didn't say it on log because technically [16:00:28] as I sometimes forget :) [16:00:34] it is not an official project [16:00:38] i know [16:01:24] we should make the puppet yamls depend on zarcillo already [16:02:07] I am getting offline - see you tomorrow! [16:02:10] have a good evening [16:02:16] bye! [16:02:40] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [19:56:19] 10DBA, 10Phabricator, 10Documentation, 10Release-Engineering-Team (Kanban), 10User-MModell: Prepare a disaster recovery plan for failing over from phab1001 to phab2001 (or phab2001 to 1001) - https://phabricator.wikimedia.org/T190572 (10mmodell) [20:41:19] 10DBA, 10MediaWiki-Database, 10Core Platform Team Backlog (Watching / External), 10Performance-Team (Radar), 10Wikimedia-Incident: Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real mast... - https://phabricator.wikimedia.org/T172497 [20:59:16] 10DBA, 10Jade, 10Operations, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Milimetric) This wikitext-in-JSON thing seems really complicated. I read through both comments above and walked away with a mu...