[05:55:49] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install dbproxy200[1-4] - https://phabricator.wikimedia.org/T223492 (10Marostegui) That rack proposal looks good to me. [06:36:06] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [08:50:01] I'm starting up the B5 servers [08:50:12] great! [08:51:41] I was expecting s1 and s6 snapshots to fail, but apparently monitoring is complaining about s4 [08:51:55] that is the one that failed yesterday, no? [08:51:59] that we talked about [08:53:15] it failed last time, but I expected it to work yesterday [08:53:34] I will check what happened after I put these hosts up [08:53:40] ok [09:07:26] do you have some time to work hand in hand so you can see the debugging? [09:08:37] db1098, db1131 and db1139 are up but will take some time to catch up [09:08:59] yeah let's do that [09:09:11] Re: [17:50] jynus: ping me once you've started mysql on db1131 please [09:09:36] yep, taking care of that [09:09:36] thanks [09:10:11] mysql.py -h db1115 zarcillo [09:10:28] yep [09:10:46] select * FROM backups where start_date > now() - INTERVAL 3 day and type = 'snapshot' order by section; [09:11:04] (the interval will depend on the hour, but you get the idea) [09:11:12] we expect 18 backups [09:11:29] right [09:11:30] we get 15 [09:11:50] we can do a better query, but it is obious to me what are some of the isssues [09:11:52] yeah, missing s1, s4, s6 [09:11:59] s1 and s6 [09:12:04] the source host was down [09:12:10] ok [09:12:31] not alerting yet because it has a 4 day timeout [09:12:39] I think it would alert tomorrow [09:13:12] snapshot.s3.2019-05-17--02-25-00 also is ongoing [09:13:19] so not a success [09:13:21] so better to also re-run s1 and s6? [09:13:39] yeah, I would do that, but those are expected failures, I would focus on s3 and s4 [09:13:47] for s3 [09:13:51] s3 likely to fail? [09:14:06] dbprov1002:~$ ps aux | grep s3 [09:14:13] I see processes are ongoing [09:14:37] and s3 is one of the latests backups to run [09:14:42] so nothing strange so far [09:14:42] indeed [09:14:44] i see it [09:15:05] maybe we need to give the prepare more memory? [09:15:12] or it could be stuck for some reason [09:15:22] I normally give it 24 hours, then kill the process [09:15:37] let's do that then [09:15:44] let's wait [09:15:44] (that will be done automatically soon (TM)) [09:15:50] but not yet [09:15:52] ok [09:16:17] once you document all this on the cheatsheet, would you mind adding the cleanup needed once you've manually killed a backup? [09:16:28] (removing the data and all that) [09:16:33] but just to have it documented with a quick command [09:16:45] yeah, we will probably be doing this for s1, s6 and maybe s4 [09:16:54] ok [09:17:06] dbprov1002:/srv/backups/snapshots/ongoing$ ls -la [09:17:14] snapshot.s1.2019-05-16--19-00-02 [09:17:19] that is empty [09:17:32] transfer didn't even start [09:17:38] yeah [09:17:39] so no metadata [09:17:45] the cleanup is easy [09:17:47] so can I safely remove it [09:18:00] rmdir snapshot.s1.2019-05-16--19-00-02 [09:18:11] done [09:18:16] if it has metadata, and doesn't update automatically [09:18:24] I set it from ongoing to failed on the db [09:18:31] ah ok [09:18:47] again, all these will hopefully be automated [09:18:53] sure [09:19:00] and in the future pileups will be prevented [09:19:40] so s1 and s6 need cleanup and maybe retry [09:19:45] now checking s4 [09:20:29] it has failed twice already (that is why it is alerting), so maybe there is a larger issue [09:20:58] should I clean up dbprov1001 ongoing for s6? it is empty too [09:21:13] yes [09:21:16] ok [09:21:19] we know why this failed [09:21:24] done [09:21:28] although let's check the logs [09:23:20] for s4? [09:24:54] May 17 02:24:58 cumin1001 mariadb-snapshots[42873]: [02:24:58]: INFO - Running XtraBackup at db1139.eqiad.wmnet:3316 and sending it to dbprov1001.eqiad.wmnet [09:25:02] May 17 02:25:00 cumin1001 mariadb-snapshots[42873]: [02:25:00]: ERROR - Transfer failed! [09:25:13] so it fails in 2 seconds [09:25:21] where is that? [09:25:26] journalctl in cumin? [09:25:32] yes [09:25:39] mariadb-snapshots subject [09:26:37] I see it yeah [09:27:05] May 16 19:00:02 cumin1001 mariadb-snapshots[42873]: [19:00:02]: INFO - Running XtraBackup at db1139.eqiad.wmnet:3311 and sending it to dbprov1002.eqiad.wmnet [09:27:07] May 16 19:00:05 cumin1001 mariadb-snapshots[42873]: [19:00:05]: ERROR - Transfer failed! [09:27:09] same for s1 [09:27:26] I can do better debugging info, probably [09:27:33] but in this case we know what happened [09:27:39] (host was down) [09:27:44] so transfer will fail [09:28:18] still needs cleanup because for now I do not delete anything in case it would be recoverable (e.g. --only-postprocess) [09:29:08] I think 10G may be too low [09:29:20] and I could do 20GB on prepare [09:29:27] at the very least [09:29:34] anyway, let's see about s4 on codfw [09:29:47] yeah [09:31:05] May 16 21:34:36 cumin2001 mariadb-snapshots[25343]: [21:34:36]: ERROR - Transfer failed! [09:31:20] source according to config is db2099:3314 [09:31:41] and that host looks to be up [09:32:17] takes a few more seconds to fail, however [09:32:28] so maybe it run for a few seconds [09:32:38] maybe nc isn't able run? [09:33:00] it is one of the options [09:33:17] or an xtrabackup application error [09:33:33] sadly I had to remove logs for xtrabackup because they were too verbose [09:33:51] I may need to check an option to only log the errors so I can retrive them back [09:33:59] what I would do is retry the backup [09:34:31] as it may fail again, as it failed twice already in a row [09:34:47] maybe the port is in use [09:34:55] maybe the netcat time to open socket is too low [09:35:13] I was checking the opened ports [09:35:22] at least we know transfer.pt is failing 9 seconds after start [09:35:55] I am guessing there is an empty dir at dbprov2001.codfw.wmnet ? [09:36:09] that would make the transfer fail? [09:36:24] root@dbprov2001:/srv/backups/snapshots/ongoing# ls [09:36:25] snapshot.s4.2019-05-16--21-34-25 [09:36:29] empty, yeah [09:36:44] and host can be connected as root on the replica? [09:36:52] what? [09:37:38] like the replica is running, can be connected, etc. [09:37:42] interesting [09:37:49] netcat is still listenting [09:38:35] I don't see it [09:38:37] on db2099 [09:38:54] listening on dbprov2001 [09:39:00] ah, sorry [09:39:18] I will leave it until you see it, then kill and cleanup [09:39:21] oh, there it is [09:39:26] tcp 0 0 0.0.0.0:4446 0.0.0.0:* LISTEN 2950/nc [09:39:32] pid 2949 [09:39:44] I was looknig into db2099 :) [09:39:48] 50 is listening [09:39:54] I guess 49 is the parent [09:40:00] (the pipe) [09:40:12] yep, I see it [09:40:38] it has been there since 16th may [09:40:44] so I will kill 2949 and 2950 [09:40:52] yes, but last night [09:41:01] it is the last one, not a previous one stuck [09:41:15] you can see it based on the dir it will extract [09:41:25] indeed [09:41:30] apparently transfer doesn't kill the process if sender fails [09:41:33] that is a bug [09:41:41] but it gives us more clues [09:41:47] yeah, it is useful [09:42:03] the listening started well - it could bee too late, but it started [09:42:22] so either it needs more wait or most probably xtrabackup issues [09:42:52] ok, killing, cleaning up everthing and retrying [09:42:56] ok [09:43:09] the monitoring worked well at least [09:43:25] yeaj [09:43:55] interestingly iptables seams clean [09:44:01] that could be another reason [09:44:14] that it failed to insert the rule? [09:44:20] that, or just maybe the cleanup worked [09:46:09] the clean up would remove the iptables rule but not kill the netcat? [09:51:12] I would do all cleanup I can [09:51:25] except on the still hopeful s3 [09:51:37] ps, directories, cumin, etc. [09:51:38] No, I mean the process itself [09:51:51] yeah, I killed everthing [09:52:06] No, sorry, I mean the clean up process the script does [09:52:17] So it is able to remove the iptable rule but won't kill the netcat? [09:52:25] or maybe the iptables rule was never inserted [09:52:28] I an unsure, I will have to check it [09:52:36] yeah, both options are on the table [09:52:53] that is why the easiest I think is to retry see what happens [09:53:06] indeed [09:55:21] so I am going to start with the easy, retry s1 and s4 which should work well [09:55:33] s6? [09:55:43] the easy ones were s1 and s6, no? [09:55:52] oh yes [09:55:54] sorry [09:55:56] that is what I meant [09:56:00] cool! [09:56:09] thanks [09:57:07] ongoing now, if you rush, you can see the backup.cnf [09:57:10] on cumin1001 [09:57:21] on my way! [09:57:31] I seeeee [09:57:34] for the command, because I never remember [09:57:42] I just crontab -lu root [09:57:59] 0 19 * * 0,2,4 /usr/bin/systemd-cat -t mariadb-snapshots /usr/bin/python3 /usr/local/bin/remote_backup_mariadb.py 2>&1 [09:58:00] nice [09:58:04] and do not loging to journalctl, just the the screen session [09:58:10] yeah [09:58:28] now I will do cumin2001 [09:58:43] for s4 [09:58:44] no? [10:00:24] that is the intention [10:00:41] I will set some ps s on dbprov [10:00:48] so the procedure would be the same, no? modify the config file and re-run it again [10:00:54] yeah [10:05:57] all that monitoring, and now it seems to work [10:06:11] \o/ [10:07:28] it took 2 seconds between netcat running or so, so I wonder if it is a race condition? [10:07:41] between netcat and the transfer start? [10:07:45] after all, it worked previous times [10:07:58] yeah , I may hack so I wait for the socket to be open? [10:08:03] yeah [10:08:07] or wait 2 seconds [10:08:08] it would hurt to wait maybe 10 seconds [10:08:11] maybe when it is loaded [10:08:18] it takes more time [10:08:24] probably [10:08:32] it doesn't hurt to add a few more seconds [10:09:23] btw, what's the status of labsdb1009? [10:09:25] still compressing? [10:09:35] (I see replication stopped) [10:09:38] yeah, I wanted to talk about ongoing stuff [10:09:40] So I guess it is stillon-going? [10:09:41] like that [10:09:48] just to finish backups [10:09:50] we can do that after lunch [10:09:52] no rush [10:09:59] s3 doesn't show a lot of activity on disk [10:10:07] so my bet is it is stuck [10:10:08] so likely to fail? :( [10:10:30] as it will be 100% transfered, you can try a only-prepare [10:10:45] or retry everything [10:10:48] as you wish [10:11:03] how do you know it is 100% transferred? [10:11:17] it only starts preparing if it is 100% transfered [10:11:20] it does checks [10:11:23] ah I see [10:11:27] to see the last file is there [10:11:41] (metadata is there and says completed) [10:12:14] so let's do a only-prepare [10:12:15] there can be corruption, but ovious corrution/failure/uncomletion is checked [10:12:34] that is also set on the config file? [10:12:40] (to do just the only prepare) [10:12:44] and re-run the process [10:12:56] you will need a new config [10:13:06] as there is no local config [10:13:16] for snapshots, it uses the one on cumin remotelly [10:13:24] you can see the command on the log [10:13:27] I think [10:13:59] [03:58:39]: DEBUG - ['/opt/wmf-mariadb101/bin/mariabackup', '--prepare', '--target-dir', '/srv/backups/snapshots/ongoing/snapshot.x1.2019-05-17--03-47-47', '--use-memory', '10G'] [10:14:19] that is for prepare [10:14:33] you want the whole thing, but I think it gets truncated on the cumin log [10:14:46] that's why we need it on the cheatsheet :p [10:15:01] well, the idea is nothing of this should be normally done [10:15:27] the problem with documenting while WIP [10:15:39] is that I have changed the parameters already 10 times [10:15:42] haha [10:15:44] so I want to do that at the end [10:15:49] nah, it is just a copy&paste for the command [10:15:52] and what it does [10:16:00] "if you need to retry -> do this" [10:16:05] no more than that is really needed for now I think [10:16:20] just to debug issues next week if needed [10:16:35] and do this kind of "usual" things as retrying things or post-processing things [10:17:14] actually [10:17:20] using the one on backup.cnf [10:17:28] and just adding only_prepare [10:17:31] should work [10:17:45] that skips the transfer and does the right thing [10:18:03] no need to run backup_mariadb when you can run remote_backup_mariadb [10:18:12] like https://phabricator.wikimedia.org/P8536 but for codfw? [10:19:09] let me double check the paramter [10:19:48] only_postprocess: True [10:20:01] is the parameter (it fails on wrong parameters) [10:20:22] sorry about that [10:20:38] but only_postprocess does mariabackup --prepare [10:20:43] hence the confussion [10:21:29] so https://phabricator.wikimedia.org/P8536#51127 [10:21:31] on cumin1001 [10:21:33] right? [10:22:13] yes, I cannot remember if it failed on eqiad or codfw, but I trust you did :-D [10:22:16] or stuck [10:22:20] eqiad [10:22:34] but we do have to do clean up first or something? [10:22:35] give it a few hours if you want, maybe it resurects [10:22:42] no, prepare expects to have the files there [10:22:46] ah [10:22:58] if you clean up is the same [10:23:08] but remove the only_postprocess to get a new transfder [10:23:32] so it would be a matter of modifying the config and then running /usr/bin/python3 /usr/local/bin/remote_backup_mariadb (as per the crontab) [10:23:47] so the idea is remote_backup_mariadb is a wrapper around the local backup_mariadb [10:23:57] .py [10:24:00] you misssed [10:24:06] ah yes [10:24:10] copy paste failure :) [10:24:11] thanks [10:24:12] you can check the current ongoing process on cumin1 and 2 [10:24:17] and the screens of course [10:24:20] let me detach [10:24:31] no need to [10:24:36] I attached with you there [10:24:37] if for some reason you have issues [10:24:38] yeah, I see it [10:24:52] remember cleaning up and the cron will take care anyway :-D [10:25:24] haha [10:25:28] full automation will come soon [10:25:30] you mean deleting the ongoing dir [10:25:31] no? [10:25:32] I promise [10:25:38] hahaha I trust you! [10:25:43] yep, see if there is any hanged process [10:25:52] yeah [10:25:53] set the db backup as failed [10:26:02] *db row [10:26:06] yeah [10:26:12] that last part is normally intelligent [10:26:29] e.g. it ignores ongoing older than 1 day [10:26:46] but better set it manually until automation is there [10:27:09] want pause or continue 5 minutes with labsdb? [10:27:19] I will go for lunch at 1pm [10:27:22] so we have time if you want [10:27:25] ok [10:27:28] so labsdb1009 [10:28:25] it is running from cumin1001 ( jaime_compress_labsdb1009 ) [10:28:43] jaime so it was different from your process on the other labsdb [10:28:54] it was going on commons last time I checked [10:28:59] had already done enwiki [10:29:10] it may take the whole weekend [10:29:15] ah yes [10:29:16] I see it [10:29:28] so, once it is finished it will start replication, cool [10:29:36] I did stop all slaves; --- exactly [10:29:59] the script is mostly trivial, I chose to stop + algorithm copy [10:30:13] beacause I am having constatn buffer issues for large tables if online [10:30:18] and I think it will be faster [10:30:21] buffer issues? [10:30:27] the alter failing [10:30:32] if all concurrent writes [10:30:36] ah right [10:30:49] you can do it anyway you want [10:30:58] I am just justifying my decisions 0:-D [10:31:09] no no, I just didn't understand what you meant with buffer issues [10:31:20] I know, just giving context [10:31:30] the imporatant thing here is [10:31:31] ok, I will check on monday [10:31:46] I will assign the ticket to you [10:31:52] sounds good [10:31:53] so you don't forget (I would) [10:32:04] that is not the important thing [10:32:22] the one is brooke has to do maintenance [10:32:29] and I asked to coordinate [10:32:42] so probably we will wait for alter to finish [10:32:53] then repool, etc. [10:32:58] and depool 1010 [10:33:13] and let her do the views first [10:33:27] sounds good [10:33:29] I can coordinate with her [10:33:34] I think anomie and her is blocked there [10:33:42] just talk to them to coordinate [10:33:59] I didn't want to have only 1 host pooled [10:34:09] play with the roles, now I have [10:34:14] 9 depooled 100% [10:34:35] 10 as analytics 100% [10:34:44] 11 as web 70%, analytics 30% [10:35:02] cool, the way we normally do it [10:35:05] be careful with the query killer [10:35:19] I didn't change it, but may be an issue on the other depools [10:35:27] I know all these you know [10:35:36] just going through everything [10:35:42] I will revert this once done: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/510126/ [10:35:52] yep, then organize as you wish [10:35:56] or as you can [10:36:15] ah, one last thing [10:36:29] I like to catch up replication from time to time [10:36:41] so when it is altering a small wiki [10:36:53] I hit contrl^c hard :-D [10:37:21] then restart on the list, that is why the tail -n +7300 [10:37:35] to skip the first 7K lines [10:37:44] or whatever [10:37:48] I hope that was clear [10:37:48] haha ok! [10:38:14] I kill it, then start replication, then restart it from the wiki it finished [10:38:19] yep [10:38:27] or you can regenereate the list, it takes 30 minutes [10:38:29] as you wish [10:38:43] (hooray for mariadb data dictionary) [10:39:03] I have not done the drops we discussed [10:39:13] I guess not a huge issue at the moment [10:39:15] yeah, that requires more time [10:39:21] also s6 and s7 can be fully compressed [10:39:25] I am compressing s7 now on 1012 [10:40:05] https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=12&fullscreen&orgId=1&var-server=labsdb1009&var-datasource=eqiad%20prometheus%2Fops&from=1557484797995&to=1558089537995 [10:40:22] I might compress only slots tables for now [10:40:30] you will be the boss [10:40:39] Anyways, I will put the stuff on the ticket [10:41:08] other than that, I expect to repool the 3 hosts today [10:41:27] great [10:41:27] and as you know, I was doing mostly backups [10:41:42] which is why things are still manual or not working well [10:43:17] don't worry! [10:44:12] I will ack 1115, db1139 and db2099 alerts with expiration on monday morning [10:44:28] (stop slave due to backups and backup alert) [10:45:52] sounds good [10:46:48] one thing that I do not know if it is 100% clear about backup configuration [10:47:15] is that options without a section apply to all sections, and those in a section apply only to that section [10:47:23]