[09:24:57] hi everybody! [09:25:15] I'd need to ask you a suggestion for https://phabricator.wikimedia.org/T108850 [09:25:48] so the purging script has been working fine on db1108 and we triple checked data on those tables after the sanitization, nothing weird came up [09:25:57] and a cron is currently purging daily data [09:26:52] now we want to do the same on db1107, but since it is the master the paranoid me would feel much better if there was a backup of all the tables before proceeding [09:28:35] /srv/sqldata on db1107 is ~761G, that should become way less doing mysqldump+gzip.. the idea would be to store the dump on one host temporarily and then upload it to hdfs [09:32:01] we don't do that [09:32:14] we stop, we clone and compress [09:32:22] otherwise, recovering would be a pain [09:32:35] unless you expect only some tables will be broken [09:32:49] in which case, we use mydumper [09:34:05] 761G binary compressed normally gets 5x less, but with tokudb it may not be woth even compressing [09:35:45] I would stop the server, make a clone on /srv/backups (remember to eventually delete it, as it would be a violation of the privacy policy) [09:36:06] and then dump each table individually with mydumper [09:36:14] which the same restrictions [09:36:28] note that they are not really master-slave [09:36:36] so purging is independent [09:36:44] or it should be [09:42:23] when you say "make a clone" do you mean copying the /srv/sqldata dir or something else? [09:43:18] (I am trying to understand the procedure) [09:43:34] stop, cp, start [09:44:06] a cp should not take more than a very few minutes [09:44:26] and why do I also need to mydump for each table ? [09:44:27] probably less on the ssds [09:44:47] you do not need it "unless you expect only some tables will be broken" [09:45:28] we can recover single tables from a cloning, it is just a longer process [09:45:51] so it you anticipate problems, we can do more work in advance [09:46:08] the problem is that recovering from a binary copy, it is instantaneous [09:46:28] from a logical file, it can take weeks [09:47:55] so it is just 2 methods, both with its advantages and disadvantages [09:48:39] so I don't expect problems, my idea is to apply the same purging strategy that we have on db1108 to db1107. My main fear is that say we discover a weird bug in January about the purging script and some data has been wrongly "sanitized" or deleted [09:49:05] so, if you do not expect problems [09:49:16] I would go for the binary copy [09:49:22] much faster [09:49:31] there is onw caveat on the whole process [09:49:55] m4-master points to db1107 [09:50:08] if you put it down, it will automatically failover to db1108 [09:50:29] check if you want to do that, and how [09:51:00] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844020 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1112.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [09:51:02] or just run mydumper locally [09:51:07] my idea is to stop the eventlogging process that does the mysql inserts on eventlog1001, so nothing should push data [09:51:22] you have some examples on the backup hosts- that way you do not need to stop the server [09:51:35] sadly, xtrabackup will not work on tokudb tables [09:52:21] do you want me to do that for you? [09:53:15] that'd be great but I didn't mean to distract you other tasks [09:53:21] *from other [09:55:57] create a task [09:56:09] and say when exactly do you want it donw [09:56:48] I warn you a logical copy may take some time to complete [09:57:02] 3 hours, maybe more [10:04:51] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844044 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1112.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1112.eqiad.wmnet'] ``` [10:09:52] If someone can have a look at https://gerrit.wikimedia.org/r/398450 I cannot run puppet compiler on it [10:13:01] 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844059 (10elukey) [10:13:20] tried to summarize all in --^ [10:19:58] labsdb1010 7519099 s51999 dbproxy1010 wikidatawiki_p 1d [10:20:03] labsdb1010 9023832 s51053 dbproxy1010 enwiki_p 1d [10:20:07] labsdb1010 9254589 s51434 dbproxy1010 wikidatawiki_p 22h [10:20:13] labsdb1010 9255250 s51434 dbproxy1010 wikidatawiki_p 22h [10:21:12] great... [10:21:21] is it time for the query killer maybe? [10:32:57] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844111 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1112.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [11:07:39] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844212 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1112.eqiad.wmnet'] ``` and were **ALL** successful. [11:13:08] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 10TCB-Team, and 12 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3844224 (10Lea_WMDE) [12:40:53] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844424 (10Marostegui) [13:25:46] 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844519 (10Jgreen) [13:26:25] 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844528 (10Jgreen) This is an tracking task for improvements for the fundraising (civicirm) database cluster. [13:27:15] 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844529 (10Jgreen) [13:29:04] 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844519 (10Jgreen) [13:33:42] 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844543 (10Jgreen) [13:40:45] 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3844569 (10jcrespo) [13:40:47] 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844568 (10jcrespo) [13:40:56] 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844059 (10jcrespo) a:03jcrespo [13:43:30] 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: test and switch fundraising cluster replication from 'mixed' to 'row' - https://phabricator.wikimedia.org/T183140#3844578 (10Jgreen) [13:47:54] 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3844605 (10jcrespo) [13:48:11] 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo) [14:02:20] 10DBA, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844654 (10jcrespo) Backup is ongoing on db1107:/srv/backups/export-20171218-135659 kill the myd... [14:02:34] 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844655 (10jcrespo) [14:34:58] elukey: I have no requirement to stop eventlog1001 [14:35:27] I am not complaining either, just saying [14:36:00] jynus: Marcel was checking the mysql insertion logs from the eventlogging side, and he seemed a bit worried that they were stopped (or maybe lagged a lot), so I decided to stop the mysql process [14:36:23] yeah, indeed that can impact and it would make the backup faster [14:42:57] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Daniel: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3844803 (10Lydia_Pintscher) [15:33:23] elukey: backup is about to finish-maybe it would be nice to go and upgrade the server now? [15:34:14] jynus: what kind of upgrade? [15:34:39] mariadb + kernel [15:35:12] ah sure, I wasn't expecting it but it might be a good moment to do it [15:35:27] after all, after backup is the safest moment [15:35:37] plus if edits are stopped [15:35:40] yep yep [15:35:51] should take < 1 minute [15:36:01] ack, +1 from me [15:36:26] ok, it is about to finish, I will log when I stop and restart the server [15:41:27] Started dump at: 2017-12-18 13:57:00 [15:41:33] Finished dump at: 2017-12-18 15:39:52 [15:49:30] nice [15:49:37] thanks a lot! [15:53:54] elukey: reboot happened [15:54:06] I was going to restart sync on db1108 [15:54:12] but I think you disabled it [15:54:13] wow 4.9.65-3 [15:54:22] so I will let you handle it [15:54:24] yep I did it [15:54:34] don't worry, you are not the first host to upgrade [15:54:45] we have done that to many other critical hosts already [15:55:02] I trust you guys completely, it is only the first time that I see it, that's it :) [15:55:15] we wouln't be so happy upgrading if not [15:55:38] enable everything you need and ping us if you see anything strange [15:57:03] super [16:17:07] 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3845287 (10jcrespo) Ready to close when @elukey is ready [16:22:39] 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3845300 (10elukey) 05Open>03Resolved Everything looks good, thank a lot! [16:22:45] 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3845302 (10elukey) [19:21:37] jynus: compare.py doesn't not support a differnt port from 3306 no? [19:21:51] I am going thru the code and I cannot see it, but just in case I am missing it :)