[09:24:57] <elukey>	 hi everybody!
[09:25:15] <elukey>	 I'd need to ask you a suggestion for https://phabricator.wikimedia.org/T108850
[09:25:48] <elukey>	 so the purging script has been working fine on db1108 and we triple checked data on those tables after the sanitization, nothing weird came up
[09:25:57] <elukey>	 and a cron is currently purging daily data
[09:26:52] <elukey>	 now we want to do the same on db1107, but since it is the master the paranoid me would feel much better if there was a backup of all the tables before proceeding
[09:28:35] <elukey>	 /srv/sqldata on db1107 is ~761G, that should become way less doing mysqldump+gzip.. the idea would be to store the dump on one host temporarily and then upload it to hdfs
[09:32:01] <jynus>	 we don't do that
[09:32:14] <jynus>	 we stop, we clone and compress
[09:32:22] <jynus>	 otherwise, recovering would be a pain
[09:32:35] <jynus>	 unless you expect only some tables will be broken
[09:32:49] <jynus>	 in which case, we use mydumper
[09:34:05] <jynus>	 761G binary compressed normally gets 5x less, but with tokudb it may not be woth even compressing
[09:35:45] <jynus>	 I would stop the server, make a clone on /srv/backups (remember to eventually delete it, as it would be a violation of the privacy policy)
[09:36:06] <jynus>	 and then dump each table individually with mydumper
[09:36:14] <jynus>	 which the same restrictions
[09:36:28] <jynus>	 note that they are not really master-slave
[09:36:36] <jynus>	 so purging is independent
[09:36:44] <jynus>	 or it should be
[09:42:23] <elukey>	 when you say "make a clone" do you mean copying the /srv/sqldata dir or something else? 
[09:43:18] <elukey>	 (I am trying to understand the procedure)
[09:43:34] <jynus>	 stop, cp, start
[09:44:06] <jynus>	 a cp should not take more than a very few minutes
[09:44:26] <elukey>	 and why do I also need to mydump for each table ?
[09:44:27] <jynus>	 probably less on the ssds
[09:44:47] <jynus>	 you do not need it "unless you expect only some tables will be broken"
[09:45:28] <jynus>	 we can recover single tables from a cloning, it is just a longer process
[09:45:51] <jynus>	 so it you anticipate problems, we can do more work in advance
[09:46:08] <jynus>	 the problem is that recovering from a binary copy, it is instantaneous
[09:46:28] <jynus>	 from a logical file, it can take weeks
[09:47:55] <jynus>	 so it is just 2 methods, both with its advantages and disadvantages
[09:48:39] <elukey>	 so I don't expect problems, my idea is to apply the same purging strategy that we have on db1108 to db1107. My main fear is that say we discover a weird bug in January about the purging script and some data has been wrongly "sanitized" or deleted
[09:49:05] <jynus>	 so, if you do not expect problems
[09:49:16] <jynus>	 I would go for the binary copy
[09:49:22] <jynus>	 much faster
[09:49:31] <jynus>	 there is onw caveat on the whole process
[09:49:55] <jynus>	 m4-master points to db1107
[09:50:08] <jynus>	 if you put it down, it will automatically failover to db1108
[09:50:29] <jynus>	 check if you want to do that, and how
[09:51:00] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844020 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1112.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei...
[09:51:02] <jynus>	 or just run mydumper locally
[09:51:07] <elukey>	 my idea is to stop the eventlogging process that does the mysql inserts on eventlog1001, so nothing should push data 
[09:51:22] <jynus>	 you have some examples on the backup hosts- that way you do not need to stop the server
[09:51:35] <jynus>	 sadly, xtrabackup will not work on tokudb tables
[09:52:21] <jynus>	 do you want me to do that for you?
[09:53:15] <elukey>	 that'd be great but I didn't mean to distract you other tasks
[09:53:21] <elukey>	 *from other
[09:55:57] <jynus>	 create a task
[09:56:09] <jynus>	 and say when exactly do you want it donw
[09:56:48] <jynus>	 I warn you a logical copy may take some time to complete
[09:57:02] <jynus>	 3 hours, maybe more
[10:04:51] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844044 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1112.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['db1112.eqiad.wmnet'] ```
[10:09:52] <jynus>	 If someone can have a look at https://gerrit.wikimedia.org/r/398450 I cannot run puppet compiler on it
[10:13:01] <wikibugs>	 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844059 (10elukey)
[10:13:20] <elukey>	 tried to summarize all in --^
[10:19:58] <jynus>	 labsdb1010	7519099	s51999	dbproxy1010	wikidatawiki_p	1d
[10:20:03] <jynus>	 labsdb1010	9023832	s51053	dbproxy1010	enwiki_p	1d
[10:20:07] <jynus>	 labsdb1010	9254589	s51434	dbproxy1010	wikidatawiki_p	22h
[10:20:13] <jynus>	 labsdb1010	9255250	s51434	dbproxy1010	wikidatawiki_p	22h
[10:21:12] <marostegui>	 great...
[10:21:21] <marostegui>	 is it time for the query killer maybe?
[10:32:57] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844111 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1112.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei...
[11:07:39] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844212 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1112.eqiad.wmnet'] ```  and were **ALL** successful.
[11:13:08] <wikibugs>	 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 10TCB-Team, and 12 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3844224 (10Lea_WMDE)
[12:40:53] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3844424 (10Marostegui)
[13:25:46] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844519 (10Jgreen)
[13:26:25] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844528 (10Jgreen) This is an tracking task for improvements for the fundraising (civicirm) database cluster.
[13:27:15] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844529 (10Jgreen)
[13:29:04] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844519 (10Jgreen)
[13:33:42] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: fundraising database improvements for 2018 - https://phabricator.wikimedia.org/T183138#3844543 (10Jgreen)
[13:40:45] <wikibugs>	 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3844569 (10jcrespo)
[13:40:47] <wikibugs>	 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844568 (10jcrespo)
[13:40:56] <wikibugs>	 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844059 (10jcrespo) a:03jcrespo
[13:43:30] <wikibugs>	 10DBA, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Epic: test and switch fundraising cluster replication from 'mixed' to 'row' - https://phabricator.wikimedia.org/T183140#3844578 (10Jgreen)
[13:47:54] <wikibugs>	 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3844605 (10jcrespo)
[13:48:11] <wikibugs>	 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo)
[14:02:20] <wikibugs>	 10DBA, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844654 (10jcrespo) Backup is ongoing on db1107:/srv/backups/export-20171218-135659  kill the myd...
[14:02:34] <wikibugs>	 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844655 (10jcrespo)
[14:34:58] <jynus>	 elukey: I have no requirement to stop eventlog1001
[14:35:27] <jynus>	 I am not complaining either, just saying
[14:36:00] <elukey>	 jynus: Marcel was checking the mysql insertion logs from the eventlogging side, and he seemed a bit worried that they were stopped (or maybe lagged a lot), so I decided to stop the mysql process
[14:36:23] <jynus>	 yeah, indeed that can impact and it would make the backup faster
[14:42:57] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Daniel: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3844803 (10Lydia_Pintscher)
[15:33:23] <jynus>	 elukey: backup is about to finish-maybe it would be nice to go and upgrade the server now?
[15:34:14] <elukey>	 jynus: what kind of upgrade?
[15:34:39] <jynus>	 mariadb + kernel
[15:35:12] <elukey>	 ah sure, I wasn't expecting it but it might be a good moment to do it
[15:35:27] <jynus>	 after all,  after backup is the safest moment
[15:35:37] <jynus>	 plus if edits are stopped
[15:35:40] <elukey>	 yep yep
[15:35:51] <jynus>	 should take < 1 minute
[15:36:01] <elukey>	 ack, +1 from me
[15:36:26] <jynus>	 ok, it is about to finish, I will log when I stop and restart the server
[15:41:27] <jynus>	 Started dump at: 2017-12-18 13:57:00
[15:41:33] <jynus>	 Finished dump at: 2017-12-18 15:39:52
[15:49:30] <elukey>	 nice
[15:49:37] <elukey>	 thanks a lot! 
[15:53:54] <jynus>	 elukey: reboot happened
[15:54:06] <jynus>	 I was going to restart sync on db1108
[15:54:12] <jynus>	 but I think you disabled it
[15:54:13] <elukey>	 wow 4.9.65-3
[15:54:22] <jynus>	 so I will let you handle it
[15:54:24] <elukey>	 yep I did it 
[15:54:34] <jynus>	 don't worry, you are not the first host to upgrade
[15:54:45] <jynus>	 we have done that to many other critical hosts already
[15:55:02] <elukey>	 I trust you guys completely, it is only the first time that I see it, that's it :)
[15:55:15] <jynus>	 we wouln't be so happy upgrading if not
[15:55:38] <jynus>	 enable everything you need and ping us if you see anything strange
[15:57:03] <elukey>	 super
[16:17:07] <wikibugs>	 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3845287 (10jcrespo) Ready to close when @elukey is ready
[16:22:39] <wikibugs>	 10DBA, 10Analytics-Kanban, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3845300 (10elukey) 05Open>03Resolved Everything looks good, thank a lot!
[16:22:45] <wikibugs>	 10DBA, 10Analytics, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3845302 (10elukey)
[19:21:37] <marostegui>	 jynus: compare.py doesn't not support a differnt port from 3306 no?
[19:21:51] <marostegui>	 I am going thru the code and I cannot see it, but just in case I am missing it :)