[05:13:03] 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) Thanks @Papaul for troubleshooting this [05:34:06] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) I have checked that all the hosts have been installed correctly [05:36:40] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) [05:37:48] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) 05Open→03Resolved I have changed the status to Active on netbox. Will close this task and will create new one for productioni... [05:40:15] 10DBA, 10Goal: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [05:40:26] 10DBA, 10Goal: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) p:05Triage→03Normal [05:42:05] 10DBA, 10Goal: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [05:59:20] 10DBA, 10Goal: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [07:17:05] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [07:31:37] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [07:40:39] compression process is almost finised, only codfw s3 is ongoing [07:41:01] great, can I take s1 codfw source then? [07:41:16] yes [07:41:21] \o/ [07:41:30] did the others work? [07:41:35] x1 yep! [07:41:45] any issue? too long? [07:42:19] I didn't measure, it was obviously longer than a normal transfer [07:42:22] I will measure s1 [07:43:06] it shouldn't be, at least for things that are not s1, s8 and s4 [07:43:20] (those have large tables that hold the queue for longer) [07:44:04] all logical backups seem done in less than 5 hours [07:47:45] so I am taking db2102 then [07:47:52] thanks [07:48:11] I would like to try something (the good method) for s6 and m5 [07:48:26] cool [07:48:46] feel free to take s6 on codfw [07:48:55] https://phabricator.wikimedia.org/T222772 [07:49:10] ok, I need to do some merges first [07:49:14] sure, no worries [07:49:17] will ping when ready [07:49:21] thanks [08:26:11] so backing up s1: 1h28m, backing up m2: 4h43m [08:26:17] https://phabricator.wikimedia.org/P8490 [08:26:38] cool [08:26:46] I will report back :) [08:30:30] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) 05Open→03Resolved This host has been fully repooled Thanks @Cmjohnson for replacing the BBU [08:45:58] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [08:52:10] this is what the snapshots takes, including compression https://phabricator.wikimedia.org/P8490#50812 [08:53:50] should I remove the dump user from the x1 hosts then? [08:53:58] I don't think it makes sense to have them there, right? [08:54:01] the production ones [08:54:04] the production one, you can [08:54:08] cool [08:54:17] there should be 2 accounts per host [08:54:22] yep, that's it [08:54:39] also double check the rest of the accounts [08:55:05] yeah, the rest look good so far [09:00:24] so the metadata file said the dump finished, but I caught some errors due to ongoing compressions [09:01:11] btw I will enable gtid on db2102 (source backup) once done, or is it disabled for some specific reason? [09:01:28] I may have missed it [09:01:41] no problem, just asking in case it was OFF for a reason [09:01:44] I will enable it once done [09:08:48] the transfer took 1h, so that is good :) [09:08:50] now preparing [09:09:15] so I was thinking of not creating snapshots the day dumps are created [09:09:55] to avoid them at the same time? [09:10:00] but I am not sure I want to skip them on night from tuesday to wednesday [09:10:50] yes [09:10:57] to your question [09:11:07] 190508 09:10:50 completed OK! [09:11:07] real 2m19.069s [09:11:16] the prepare ^ [09:11:27] yeah, most of the time for prepare is if they hare writes at the same time [09:11:30] *have [09:11:46] because they may have 2 houres of changes to reply/UNDO [09:11:56] plus compression, which takes some time on the HDs [09:12:40] the whole thing should take 30 minutes if they are pre-prepared and pre-comrpessed [09:12:50] (including transfer time) [09:13:09] vs 50 hours for logical backups [09:13:36] for example snapshot.s6.2019-05-07--20-00-02.tar.gz is 210 GB [09:14:04] so the compression is worth it because it is ~4 times less than unzipped [09:15:05] and no cpu is spend compressing on source [09:35:43] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [09:36:36] db2102 is stopped, do you know why? [09:36:51] yeah, I am doing xtrabackup from it [09:36:54] with replication stopped [09:36:55] ah [09:37:06] You said it was ok to take it :) [09:37:11] I thought you were going to use the source one [09:37:15] it is ok [09:37:24] oh [09:37:28] isn't that one the source? [09:37:29] I just thought you were going to use db2097 [09:37:37] that is the test one [09:37:40] aaaah [09:37:41] my bad [09:37:44] but both are ok [09:38:08] sorry, my bad :) [09:38:10] btw: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508789/ [09:38:59] thanks :* [09:39:00] just should you know that [09:39:09] no accounts are setup on the test host [09:39:15] so you will have to add those manually [09:39:17] yep [09:39:22] I noticed [09:39:25] the source ones [09:39:29] have all + dumps [09:39:55] this one may have only the common ones + one for root for import [09:40:08] but none for mw, etc [09:40:44] it wouldn't hurt to do a double check on every host before pooling it for the first time [09:40:54] plus pooling it with low load initially [09:41:04] yeah, I am doing that exactly [09:41:09] checking + pooling with 1 [09:41:11] plus the events [09:41:18] which will not be on the test one [09:41:22] also on the checking list [09:42:07] I am going to switch to the source (db2097:3311) for the next host [09:42:32] you can use the same one, it was just FYI [09:42:44] it is easier (and safer to use the source) :) [09:43:00] it won't be faster [09:43:51] yeah, not a problem [09:44:38] did you see: https://support.microsoft.com/en-us/help/4499612/intel-ssd-drives-unresponsive-after-1700-idle-hours ? [09:48:21] eeeee [09:48:22] scary [09:48:28] that's 70 days [10:40:14] I am going to deploy the latest version of transfer.py [10:40:46] https://gerrit.wikimedia.org/r/#/c/508801/ this? [10:41:29] yes [10:41:52] cool! I can try the new option too once deployed [10:42:15] one thing to notice is that [10:42:34] backups are created with ./sqldata [10:42:51] but transfer cannot overwrite existing files for security [10:43:06] so I extract the parent subdir [10:43:14] so the location should be /srv/sqldata [10:43:30] as if there was no subdir [10:44:03] it will make sense when done, and ether if there isa mistake, won't be a problem or it will abort early [10:44:04] so what does that imply? do we have to change the destination dir? [10:44:10] ok :) [10:44:11] no, it is handled [10:44:15] ah ok [10:44:18] just that if you were to do it manually [10:44:26] it would be a bit different [10:45:18] in other words, --decompress asumes it is a backup [10:45:31] it may or may not work for a generic tar.gz [10:46:10] it is made to work for backups for now [10:46:33] because otherwise it would be dangerous [10:46:58] e.g. decompresses sqldata.s7 and overwrites it without intending it [10:47:08] when you actually wanted to write sqldata [10:48:31] yeah, indeed [10:48:40] it will all make sense in the end [10:48:47] but can I use --type=xatrabackup --stop-replication for this provisioning? [10:49:10] it is --stop-slave [10:49:17] but yes, when I deploy [10:49:51] the whole stop slave also requires discussion [10:49:57] because what if it is stopped already? [10:50:01] it restarts it [10:50:13] what if there are 2 backups going at the first time? [10:50:19] how do we get the binlog coordinates? [10:50:32] yeah, I didn't check the exact option i assumed you'd know what I mean ;) [10:50:34] again, I can only do one thing at a time :-D [10:54:32] for now I am more interested on testing --type decompress [10:54:41] which should be much faster for the existing snapshots [11:18:42] I am running decompress on db2117 (not sure it will work without the role) [12:06:21] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) ` root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmnet:/srv/backups/snapshots/latest/snapshot.s6.20... [12:30:19] make sure to update the cheatsheet once you have seen the command works as expected :) [12:43:51] if finished but I cannot do more until the host is configured [12:43:54] *it [12:47:05] :( [12:47:23] /opt/wmf-mariadb101/bin/mariabackup: Error writing file 'UNKNOWN' (Errcode: 32 "Broken pipe") [12:47:28] ? [12:47:51] on the transfer I was doing [12:48:03] in the middle? [12:48:09] that is weird [12:48:12] at the end [12:48:35] https://phabricator.wikimedia.org/P8492 [12:48:36] the ongoing executable is not updated [12:49:02] I will start it again [12:49:05] Broken pipe is a connection was killed [12:49:13] maybe a network blip? [12:49:30] but "Error writing file 'UNKNOWN'" looks like a mariabackup bug? [12:50:11] Failed to copy file ./ops/db32_query_review_history.MYD [12:50:17] could be myisam issue [12:50:19] that file is 1.1G [12:50:27] I would suggest to delete it [12:50:28] I am going to drop those [12:50:29] yeqha [12:50:31] yeah [12:50:46] the whole table [12:50:59] yeah, going to drop all thos old tables [12:51:48] db2102 was just loaded on the new version [12:52:02] those others may have obscure problems [12:52:31] haha [12:52:58] that is why pre-reparding and just decompressing will solve most our problems [12:53:17] it is also much faster, it writes at 240MB/s [12:53:33] nice [12:53:57] took only 30 minutes to transfer s6 [12:54:05] oooooh [12:54:08] that is a big win [12:54:49] I don't know what is the plan with db2117 [12:55:00] you can say [12:55:08] I will go away for lunch for now [12:55:32] you can fully provision db2117 into s6 if you want [12:55:37] including the role and all that [12:55:42] that'd be great :) [12:55:48] if you can [13:01:58] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10Marostegui) >>! In T206203#5166921, @jcrespo wrote: > ` > root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmn... [13:28:52] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) > To set up replication on the destination, questions: does the metadata file contain only GTID coordinates so we have to do the "translation" lookin... [13:32:39] 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) Hey Paul Your dispatch number is 707804261. Once I get a waybill number I will send that over to you. Thank you [13:50:48] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Reedy) [15:12:53] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10jcrespo) Thanks. [16:25:04] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) Executed as I mentioned above: ` SET GLOBAL gtid_slave_pos = '0-180359184-3070029963,171970705-171970705-239075862,171974883-171974883-1239749870,17... [16:25:27] ^ marostegui backups work (a different thing is having 100% automation) [16:26:10] yaaaay [16:26:12] good job [16:26:16] :) [16:26:21] I know gtid is scary [16:26:28] but in this case it should work [16:26:40] because the problem we have is the "contamination" [16:26:53] but it should work as good or as bad as the original host [16:27:15] the source host may benefit from some cleanup [17:37:46] marostegui: moritzm, I am not going to upload mariadb 10.1.40 beacause see the release notes- it is a packaging bug that doesn't affect us https://mariadb.com/kb/en/library/mariadb-10140-release-notes/ [17:41:53] already .40? 39 just went out! [17:41:56] no? [17:47:04] ack! [17:48:04] marostegui: it is a rpm bug only, they have increased the version just for that https://github.com/MariaDB/server/commit/101144f27956ad0ba547e8b73a24545abc69a15b#diff-30e05864e6af52293e9ee90a053bb658 [17:48:56] we should have it into account when we migrate to red hat 8! [18:54:01] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) Technician will be on site tomorrow between 9:30 and 10am CT [19:11:24] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) Great news! Thanks!