[05:14:49] 10DBA, 10Patch-For-Review: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) s6 finished compression on labsdb1012, so this host is ready to get all the other tables compressed like labsdb1009 is having at the moment. [05:19:15] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [07:57:15] I am transferring /srv/backups/snapshots/latest/snapshot.s5.2019-05-13--23-09-00.tar.gz from dbprov2001.codfw.wmnet [09:31:28] so [09:32:09] o/ [09:33:11] So I saw https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/510453/ [09:35:44] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [09:38:44] what do you think of https://gerrit.wikimedia.org/r/c/operations/puppet/+/510453 ? [09:39:10] also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/510700 [09:39:39] I am checking the last one yeah [09:39:42] THe first one looks good [09:39:51] I think that is kinda what we agreed on [09:39:54] (the schedule) [09:40:32] we may even make it pass one week, it depends [09:40:45] I am now creating some monitoring [09:40:59] I won't have time to fix more stuff, but I think monitoring is nice [09:41:07] sure [09:41:24] I will test it, but you can revert if something is wrong [09:41:25] can you also review the cheatsheet and add whatever you think is necessary? ie: re-launch snapshots if they fail [09:41:36] where? [09:41:55] on the wikitech page [09:42:02] the small section I created [09:42:12] https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Backups_quick_cheatsheet [09:42:25] ok, I can add that [09:42:49] thanks, that'll be helpful [09:43:09] so the most common issue I saw [09:43:15] is the transfer failing [09:43:44] which at the moment requires manual rm of the data on ongoing [09:43:52] what I do is go to zarcillo [09:43:54] and run [09:44:22] mysql.py -h db1115 zarcillo [09:44:32] select * FROM backups where start_date > now() - INTERVAL 1 day and type = 'snapshot' order by section; [09:44:50] rm the data from the appropiated dbprov host, no? [09:44:56] yes [09:44:57] the failed snapshot I mean [09:45:01] it will be on the metadata [09:45:06] maybe kill some process [09:45:11] it will depend on how it failed [09:45:25] once I saw the prepare stuck [09:45:37] but mostly, no process ongoing, just transfer failed [09:45:56] I don't normally retried when we had daily backups [09:46:17] so you rm the data and then re-launch the snapshotting process? [09:46:18] for example if you run now "select * FROM backups where start_date > now() - INTERVAL 3 day and type = 'snapshot' order by section;" [09:46:39] you can see one s4 backup on codfw failed [09:46:54] (it will be more clear when monitoring is in place) [09:47:11] let me check [09:47:54] ah yep, it is not there [09:48:22] both monitoring, retries and prepare based on absolute path will make this unnecessary [09:48:32] this is just a FTY for now [09:48:38] *FYI [09:48:52] ok [09:49:01] but I don't want to redo the backup system just before I go [09:49:23] ofc! [09:49:37] so, giving that s4 codfw failed, how would I proceed? [09:49:38] I thought I had more time, but yesterday I got distracted [09:50:00] normally, I wouldn't care (except the deletion/kill process) [09:50:29] as if there is an ongoing file/dir, backup stops [09:50:37] it will be run again [09:50:47] otherwise, e.g. many fail, or you want to do it [09:50:55] ah, so there is a retry already? [09:51:00] no [09:51:06] just, it used to happen every day [09:51:10] ah hehe :) [09:51:13] and latest is never deleted [09:51:18] right [09:51:21] unless it is successful [09:51:36] e.g. if you go to [09:52:24] dbprov2001:/srv/backups/snapshots/latest [09:52:49] you will see snapshot.s4.2019-05-12--22-40-35.tar.gz which is a day later than the others [09:52:58] but as long as it is there, no big deal [09:53:19] ah I see [09:53:28] only archive gets purged [09:53:37] so now that there will no snapshots being generated everyday, what's your recommendation? [09:53:40] so mostly cleanup (due to a bug) [09:53:43] generate it manually? [09:53:47] yeah, that is a more interesting way [09:53:54] I mean, first try to understand why [09:53:58] sure [09:54:06] but let's say you know (e.g. host had to be put offline) [09:54:23] runing what the cron does is enough, that is [09:55:11] this: /usr/bin/python3 /usr/local/bin/remote_backup_mariadb.py [09:55:28] that will do a backup of everthing on /etc/mysql/backup.cnf [09:55:29] will that run for all the snapshots? [09:55:32] ah right [09:55:37] so same behaviour as we had with dumps [09:55:38] if you just want one, you can edit it [09:55:42] so I should generate a fake config [09:55:43] right [09:55:43] the config file [09:55:46] good [09:55:48] there is no --config option [09:55:55] ok [09:56:09] run it, then let puppet revert the file [09:56:33] if you do it like that, logging will be to stout [09:56:41] (please all this process, to the cheatsheet, as it is easier to read than irc logs when you are busy in the middle of the week alone! - doesn't have to be super formatted, just copy paste all these commands) [09:56:50] normally logging on cumin goes to journalctl [09:57:02] and the local mariadb_backup of dbprovs [09:57:07] ah cool [09:57:13] to /var/log/mariadb-backups [09:57:28] so you can see if the transfer failed [09:57:30] or the prepare [09:57:33] or the compression [09:57:35] etc. [09:58:01] ah good [09:58:03] that's key! [09:58:05] I don't have any undertstanding of something that fails constantly [09:58:25] but with 2 failures I see transfer can be sometimes fragile? [09:58:32] mariabackup? [09:58:55] need more failures to find a pattern [09:59:01] yeah [09:59:04] hard to say with not many [09:59:08] it could be pretty much anything [09:59:32] I think only the ones that failed where the ones that did not stop replication, so could be it? [09:59:54] could be [10:00:04] xtrabackup log sadly has to be supressed, because otherwise it overloads the pipe [10:00:06] you added the stop slave to all of them? just to compare? [10:00:16] no, but feel free to experiment [10:00:44] that is a backups-cumin1001.cnf.erb et al [10:00:50] that is easy to manipulate [10:00:56] you can also add misc [10:01:08] they should work, but haven't added them yet [13:57:07] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [13:57:47] issues with s7 response time [13:58:07] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=11&fullscreen&orgId=1&from=1558011480920&to=1558015080920&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [13:58:14] is there train on going? [13:58:28] oh wow [13:59:25] T222772 [13:59:25] T222772: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 [13:59:32] https://logstash.wikimedia.org/goto/7af37f2092d90c485dc957cafca14ac7 [13:59:33] that [14:01:09] it matches the spike of 500s on -operations [14:01:11] on all the DCs [14:03:03] query and connection latency from localhost looks fine [14:03:46] I am checking individual hosts on s1 and they look ok so far [14:05:00] it is now recovered [14:06:07] I am trying to see if it was only one host, and which ones [14:07:32] at the very least db1079 involved [14:07:51] s1 ones were ok [14:07:53] it hit 3000 connections [14:08:09] that's s7 no? [14:08:14] yes [14:08:33] s1 I didn't see any spike on any hosts [14:08:39] maybe s7 overloaded the rest (centralauth?) [14:08:58] threads running was 0 before on db1086 [14:26:29] for anyone affected by the pdu swap in b5 I am ready to remove power on 1 side [14:43:04] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [15:36:38] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1115 [15:37:04] sweet [15:37:58] probably in the future we can create a custom dashboard on zarcillo and just leave there a single alert [15:42:45] marostegui: disk on db1133 has been replaced [15:43:09] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [15:43:20] cmjohnson1: lovely! can you destroy+create the raid there? [15:43:36] cmjohnson1: while I have you here, could you check power on dbproxy1006? [15:43:55] it is up, but it seems one of the power supplies may be disconnected or toasted [15:44:33] I can put it down if needed [15:45:07] (since 25 minutes ago) [15:45:24] jynus fixed...sorry the new pdus require a little more force [15:45:30] it is ok [15:45:37] that is why we have alerting :-D [15:47:32] thanks, I see it up [15:50:37] jynus: ping me once you've started mysql on db1131 please [15:50:43] no rush, just let me know when done [15:51:34] we may wait until tomorrow [15:51:43] see dcops [15:54:47] I am not there, but that is ok with me [16:04:27] cmjohnson1: do you want me to check if the raid has rebuilt fine and ping you if we need you to destroy+create? [16:04:42] i am rebuilding it now [16:04:58] nice! [16:06:04] marostegui ...there is a bigger issue now....now all but 1 disk is reporting bad including the newly replaced disk. this will required more work and I am in the middle of other things. I don't have spare cycles at the moment [16:06:43] cmjohnson1: don't worry, we can keep working with the other hosts, but sounds like we need a new controller from dell? [22:23:35] 10DBA, 10Operations, 10ops-codfw: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) [22:24:33] 10DBA, 10Operations, 10ops-codfw: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Papaul) p:05Triage→03Normal