[05:07:42] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [05:31:26] <_joe_> https://phabricator.wikimedia.org/T222224 should be interesting to you [05:31:46] yep [05:31:48] we are aware of it [05:31:51] thanks :) [05:39:05] <_joe_> we're discussing it a bit at techcom [05:49:35] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) @aaron @Krinkle @jcrespo I would appreciate a review of ^ - it should be a quick one... [06:19:40] 10DBA, 10Analytics, 10Analytics-EventLogging, 10Operations, 10ops-eqiad: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10elukey) @Cmjohnson I'd need a heads up of ~15 mins before the maintenance to shutdown the host properly, but we can do it anytime! [06:56:25] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10jcrespo) [06:59:05] will you add db2116 to mwconfig files or you want me to do that? [07:05:51] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10Marostegui) Nice! I guess I have had too many nightmares with GTID, I need to start trusting it again, this will definitely help! :) Probably this needs to go... [07:18:45] the s1 snapshot on codfw worked, the one on eqiad didn't [07:20:26] the ones on codfw took 2 hours for s1 [07:24:07] I am about to finish compressing db2098 [07:24:52] are you using db1139? Can I downtime it? [07:25:39] I am not using it [07:25:40] (it could be the snapshotting script) [07:26:34] I am using db2098:s2 but it will be released in a bit [07:26:54] I think the copy failed there and it left the replica stopped [07:27:05] db1139 you mean? [07:27:14] yes [07:28:09] do you want me to add db2117 to the mwconfig files? [07:28:18] I need to add a few others [07:28:19] I was going to ask you [07:28:21] so I can include it [07:28:28] you are leading this [07:28:40] I am not going to do anything without your ok [07:28:58] I will add it, I am preparing a patch for other hosts, so I will include db2117 [07:29:29] I have not done a check of data/grants/etc [07:31:15] I can do that [07:35:04] I was wrong, all snapshots failed today [07:35:11] on transfer [07:35:17] so probably a software bug [07:37:16] db2098:3312 released, feel free to start replication once you are done with compression [07:48:42] I am only doing compression on s3 [07:49:07] I am going to do a test backup of db1139:s1 stopping replication [07:49:13] heh, I just saw a funny thing [07:49:26] You remember the parsercache key names, the IPs [07:49:39] ? [07:49:45] '10.64.32.72' => '10.64.16.20', # pc1008, B8 4.4TB 256GB # pc2 [07:49:47] and [07:49:52] '10.64.48.128' => '10.64.32.29', # pc1009, C3 4.4TB 256GB # pc3 [07:50:00] 72.32.64.10.in-addr.arpa domain name pointer db1133.eqiad.wmnet. [07:50:09] 128.48.64.10.in-addr.arpa domain name pointer restbase1025-c.eqiad.wmnet. [07:50:13] now those are real IPs [07:51:45] https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&from=1556783494104&to=1557388234105&var-server=db2098&var-datasource=codfw%20prometheus%2Fops [07:52:13] better: https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&from=1556783494104&to=1557388234105&var-server=db2098&var-datasource=codfw%20prometheus%2Fops&panelId=12&fullscreen [07:52:28] oh wow [07:54:47] A sanity check on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/508992/ would be appreciated :) [07:57:17] It had the same problem than you /opt/wmf-mariadb101/bin/mariabackup: Error writing file 'UNKNOWN' [07:57:30] there may be something else happening there [07:57:33] interesting [07:57:39] also myisam? [07:58:36] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 34: invalid start byte [07:58:51] I didn't get that [08:03:07] thanks :) [08:09:58] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [08:15:38] so I have broken the transfer, fixing it [08:16:02] :) [08:18:38] a 2 character fix, now it works [08:18:49] haha [08:19:05] that was the cause of the error about utf-8? [08:19:08] https://jynus.com/gif/facepalm.gifv [08:19:14] I will tell you [08:19:17] A new one!!!!!!!!!! [08:19:20] want to shouw you first [08:19:44] we've been too long without a new release of gifv! [08:20:41] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [08:24:43] marostegui: spot the difference: https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/509001/1/wmfmariadbpy/transfer.py [08:25:14] hahahaha [08:25:23] was it easy to spot while debugging? [08:25:53] yes [08:26:06] the problem here is the lack of unit testing to prevent regression [08:28:44] I can create a tarball with a backup of something you may need [08:28:59] any section I can help with? [08:29:02] yes! [08:29:03] s3 :) [08:29:24] (in codfw) [08:29:24] on both dcs? [08:29:43] only codfw [08:29:51] it will go to db2105 and db2109 [08:30:23] ok, let me create one [08:40:11] Running XtraBackup at db2098.codfw.wmnet:3313 and sending it to dbprov2002.codfw.wmnet [08:40:50] \o/ [08:42:17] when it is done on dbprov2002:/srv/backups/snapshot/latest [08:42:35] you will be able to use transfer.py --type=decompress [08:42:47] great [08:43:00] and it should be much faster as it will be precompressed and pre-prepared [08:43:51] transfer.py --no-checksum --no-encrypt --type=decompress dbprov2002.codfw.wmnet:/srv/backups/snapshots/latest/xxxx.tar.gz db2105.codfw.wmnet:/srv/sqldata ? [08:43:53] transfer.py should work back again [08:44:21] Going to add that to the cheatsheet [08:44:37] something along those lines, not sure about the destination hosts [08:44:40] if you are going to do 2 [08:44:45] you can do the 2 at the same time [08:45:19] .tar.gz db2105.codfw.wmnet:/srv/sqldata db21XX.codfw.wmnet:/srv/sqldata [08:46:32] spoiler-- it won't do the 2 at the same time, for the dbprov it will be faster to do both in parallel to test the 10G [08:46:49] what? [08:46:53] it will or it will not? [08:46:59] Added that line to the cheatsheet [08:47:03] it intends to be in parallel [08:47:04] Feel free to add/modify more stuff [08:47:10] but it is implemented in series [08:47:42] ah I see [08:47:48] because it is supposed to do multi-cast or torrent [08:48:03] in this case it is better to run it twice at the same time [08:48:19] and see if we can use the 10G better [08:48:31] good, let me know when I can run it :) [08:48:52] it is on cywikibooks [08:48:56] some more time to go [08:49:10] no worries :) [08:57:29] 10DBA, 10Goal: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10jcrespo) [08:57:34] 10DBA, 10Patch-For-Review: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 (10jcrespo) 05Open→03Resolved Compression has finished for these hosts. [09:17:38] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/v... [09:19:43] I am going to stop the mysqls on old dbstores [09:19:51] cool [09:20:13] this doesn't have to happen now, but you should be thinking if there is something you need to keep from there [09:20:27] yeah, I just pinged the chase and john [09:20:31] to see if they can come back to me [09:21:02] I won't send it to dcops until a few weeks pass [09:21:13] at the very least [10:22:50] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1133.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1133.eqiad.wmnet'] ` [10:35:04] I'm stupid [10:36:07] your words! [10:39:18] https://gerrit.wikimedia.org/r/509034 [10:40:10] haha [10:42:29] that is some weird diff output: https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/509034/2/wmfmariadbpy/transfer.py [10:51:43] prepare is running now [10:51:50] \o/ [13:16:42] hows the prepare going? [13:28:19] Backup s3 generated correctly. [13:28:38] thats dvprob2002 [13:28:47] nice [13:28:54] So I can trigger the decompression? [13:29:04] yes [13:29:08] if [13:29:09] \o/ [13:29:16] code is updated to the last version [13:29:24] ok, will update my git [13:29:57] I mean puppet on hosts [13:30:02] because before it failed [13:30:09] ah [13:30:12] I will run it then [13:32:35] here we go! [13:34:44] is it working? [13:34:50] yeah! [13:35:12] I put the two hosts to do it in "paralell" as you suggested [13:35:16] so so far db2105 is being copied [13:35:35] we could change the parallism to be real [13:35:59] I didn't thought at the time of the 10G -> 1G use case [13:47:17] are you using db2098? [13:47:24] nope [13:47:34] I want to restart replication there [13:47:42] because of the previous bug [13:47:45] go for it! [13:49:38] before you leave, let's please sync on what things are ongoing/pending [13:50:05] Sure, I will work a couple of more hours probably [13:50:16] I will let you know before I leave [13:50:31] yeah, I don't mind if it is 5 minutes or 4 hours, I created a bit or a problem with that bug [13:50:48] and want to make sure all alerts are accounted for [13:51:14] to see if I can leave https://gerrit.wikimedia.org/r/509012 deployed [13:51:21] so from my side I am touching: db1138 (not in tendril yet), db1081, db2105 and db2109 (neither of those codfw hosts are in tendril) [13:52:10] ok, that matches the downtimes [13:52:58] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509012/2/modules/profile/templates/mariadb/daily-snapshots-eqiad.cnf.erb -> any reason to have stop slave true on some but not on others? ie s8 has stop_slave: True but that is not the case with s2 or s3 or s5 [13:53:13] at the moment is a guess [13:53:25] I belive it will be much faster on hosts with many writes [13:53:39] ah ok, just checking if it was intentional :) [13:53:41] but it is a bit of a test [13:54:10] the good news is that I had left it like that but with the bug and still worked [13:54:17] so that should only make it faster [13:54:22] haha [13:54:31] but we will see if it takes 4 hours or 24 hours [13:54:36] fine by me, I am not touching any of those :) [13:54:41] so just +1ed it [13:54:53] it is a bit of, let's gather data, then review [13:55:02] I am going to grab a coffee, I need a break from writing perf reviews :) [13:55:08] please do [13:55:18] * marostegui goes for a chai! [14:56:18] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [14:57:44] db2105 finished the decompression and GTID worked, replication is flowing [15:00:20] I will update the cheatsheet [15:02:23] same with db2109 :) [15:08:48] Worked like a charm, great job! [15:23:04] 10DBA, 10Operations, 10ops-codfw: db2112 doesn't show service tag in idrac - https://phabricator.wikimedia.org/T222845 (10RobH) [15:24:34] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [15:24:42] 10DBA, 10Operations, 10ops-codfw: db2112 doesn't show service tag in idrac - https://phabricator.wikimedia.org/T222845 (10Marostegui) [15:39:04] recap of hosts [15:39:08] db1081, already pooled in production [15:39:16] db1138 ready with data but won't be pooled today [15:39:26] db2105 and db2109 already on s3 but won't be pooled today [15:39:33] db2112 HW maintenance [15:39:37] db2114 HW maintenance [15:40:04] thanks [15:59:31] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) [15:59:44] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) [16:06:54] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [16:08:19] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [16:08:45] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) 05Open→03Resolved @Marostegui @jcrespo main board replacement complete on db2114. The problem has been resolved. You can take over now. System Board Fan1A 14% 3480 RPM 840 RPM N/A 480 RPM... [16:10:06] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) Thanks Papaul!