[05:34:49] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) >>! In T239170#5694038, @jcrespo wrote: > > The process wasn't well documented, so I added it explicitly at https://wikitech.wikimedia.org/wiki/MariaDB/Backups#... [05:43:49] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [05:48:27] 10DBA, 10Operations, 10ops-codfw: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) 05Open→03Resolved All the wikis have had their main tables checked and there is non apparent data drifts, so I am going to repool this host and consider this fixed for now. If it happens again, we... [06:21:10] 10DBA: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10Marostegui) m5 codfw's master is now db2135. [06:21:17] 10DBA: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10Marostegui) [06:22:18] 10DBA: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10Marostegui) 05Open→03Resolved All the hosts have been productionized and are now masters of mX on codfw. Pending is decommissioning old hosts - that is being tracked at: T228258 [06:22:20] 10DBA, 10Operations, 10ops-codfw: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) [06:29:25] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) @Andrew I was checking the grants for the `nova` database and there are no grants defined on `production-m5.sql.erb` for that given database. However, the databa... [06:53:33] 10DBA, 10Patch-For-Review: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui) [06:58:14] 10Blocked-on-schema-change, 10DBA: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 (10Marostegui) [07:09:28] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [07:44:49] 10Blocked-on-schema-change, 10DBA: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 (10Marostegui) [07:45:01] 10Blocked-on-schema-change, 10DBA, 10Core Platform Team: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 (10Marostegui) [09:32:27] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10ArielGlenn) >>! In T143870#5689733, @Marostegui wrote: ... > It was never a vslow host. Ok, that's new and very undesirable behavior. In the past it was always the case that for xml/... [09:34:57] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10Marostegui) >>! In T143870#5696254, @ArielGlenn wrote: >>>! In T143870#5689733, @Marostegui wrote: > ... >> It was never a vslow host. > > Ok, that's new and very undesirable behavio... [09:49:57] marostegui: re:Puppet CA, I don't expect you need to do anything *right now*, but you'll need to have all DBs restarted before the old CA expiration (next July) (cc jbond42 ) [09:50:26] volans: I am just worried about replication [09:50:49] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) >>! In T236277#5695092, @Marostegui wrote: > @jbond maybe it is a good idea to disable puppet on all databases before merging the change and then trying... [09:50:50] why would break? the change will change the CA on file, mysql has already the CA in memory [09:51:06] volans: Yeah, we know the theory, but who knows [09:51:12] in case one gets restarted it gets the new CA cert, that would validate old certs as well as new ones [09:51:21] so all should be good AFAICT [09:51:52] volans: yep, but as I said on the comment, I would prefer to test it before starting to enable puppet back on all the hosts [09:51:57] It should be a quick test with codfw [09:52:02] sure sure [09:52:13] like, stopping and starting replication, restarting mysql etc [09:52:16] I guess we'll disable puppet fleet wide anyway [09:52:27] volans: that's what the patch commit says I think [09:53:52] yes ill disable puppet and re-enable it slowly. As you both say should be safe but no harm in beinbg cautions and checking, cheers [09:55:04] thanks [09:55:48] I guess we should create a task to track the dbs restarts before july [09:56:34] and a cumin oneliner to get uptime ;) [09:56:52] spam! [09:57:05] * volans joking, I know you have that already in tendril/zarcillo [10:25:42] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10ArielGlenn) >>! In T143870#5696255, @Marostegui wrote: >>>! In T143870#5696254, @ArielGlenn wrote: ... >> Ok, that's new and very undesirable behavior. In the past it was always the ca... [10:26:15] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10Marostegui) Will do - thanks for looking into it. [10:40:40] https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=idp2001.wikimedia.org-Monthly-1st-Thu-production-idp&from=1574848528622&to=1574851165544 [10:41:24] only 2 files backed up? [10:41:35] yeah [10:42:14] Interesting [10:42:27] That's physical files? [10:42:33] like: ls [10:42:35] ? [10:42:40] https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=1574848528622&to=1574851165544&var-dc=eqiad%20prometheus%2Fops&var-job=dbprov2001.codfw.wmnet-Monthly-1st-Thu-production-mysql-srv-backups-dumps-latest [10:42:44] ^ hehe almost 3000 there [10:42:49] I guess s3 is there? :) [10:43:14] that is after consolidation [10:43:28] originally it is like 200.000 [10:43:45] consolidation meaning the packaging no? [10:46:15] https://phabricator.wikimedia.org/P9761 [10:46:35] yeah, that's what I meant [10:46:38] cool [13:53:01] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Ladsgroup) It would be great if the job of rebuildTermItems in mwmaint1002 gets disabled (for example killed) right before the failover. I'm worried the script skip lots of items because of the... [13:54:44] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Marostegui) >>! In T239238#5697067, @Ladsgroup wrote: > It would be great if the job of rebuildTermItems in mwmaint1002 gets disabled (for example killed) right before the failover. I'm worried... [13:59:51] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10ArielGlenn) Do we know what queries these clients were running? A first pass through the relevant MediaWiki code doesn't show any good suspects. [14:04:16] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Ladsgroup) Yup [14:09:10] 10DBA, 10Dumps-Generation: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 (10Marostegui) No, I wasn't able to see any query running bust just the threads connected as shown at T143870#5688483 Tendril also doesn't report any slow query during that timeframe. [14:10:53] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Marostegui) [14:51:35] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) Database created: ` root@db1133.eqiad.wmnet[(none)]> create database if not exists nova_cell0; Query OK, 1 row affected (0.01 sec) root@db... [14:55:44] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) I have created the grants for cloudcontrol hosts for `nova` user, so we can remove `'nova'@'%'` too: ` root@db1133.eqiad.wmnet[(none)]> sho... [14:58:22] marostegui: lmk when you clean up those older grants and I'll keep an eye on some logfiles [14:58:26] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) Grants added for `nova_cell0`: ` root@db1133.eqiad.wmnet[(none)]> show grants for 'nova'@'208.80.154.23'; show grants for 'nova'@'208.80.15... [14:58:48] andrewbogott: can you test the new grants for the new databse and then we can try to clean up the old ones? [15:00:47] tip: check the collation to verify is the one intendend [15:00:56] so we don't have issues in the future again [15:02:33] that's a good one [15:02:37] I created it with the defaults [15:02:58] so binary [15:02:58] marostegui: the grants work, but I realized that our other databases specify the region name so this one probably should be nova_cell0_eqiad1 instead of just nova_cell0 :( [15:03:10] sorry :( can it be renamed? [15:03:24] andrewbogott: I can drop and create it again, but I will have to change the grants too [15:03:28] that's the only change? [15:03:34] andrewbogott: did you see the collation issue? [15:03:50] jynus, is collation == encoding? [15:05:03] let me rephrase, you want to make sure both the encoding and collation are the one the application prefers [15:05:13] encoding = ASCII, utf-8 [15:05:28] collation = case sensitive, case insensitive, accent insensitive, etc. [15:05:36] ah, ok [15:05:49] so, it should be utf-8 and case sensitive [15:05:51] remember the issue with utf8 we had with one of those dbs [15:05:57] which I think is how the existing nova_eqiad1 db is [15:06:00] some time ago [15:06:10] let me check nova_eqiad1 [15:06:18] I think it was on upgrade [15:07:41] so nova_eqiad1 is showing : | nova_eqiad1 | CREATE DATABASE `nova_eqiad1` /*!40100 DEFAULT CHARACTER SET binary */ | [15:07:44] let me check the tables [15:08:10] I don't remember what was the issue honestly, I just remember there was a problem [15:08:20] yeah, there is a ticket actually, but no idea what's the story [15:08:23] the tables are utf8 [15:10:11] this is just for andrew to check that [15:10:17] yep [15:11:21] to see how important are vms created with emoji names :-D [15:11:27] hahaha [15:11:39] I have never ever got the revoke syntax correct at the first time [15:11:51] The docs don't really specify encoding on db creation but I see lots of support requests caused by folks not having utf-8 [15:12:03] andrewbogott: that is a safe bet [15:12:33] utf8 for charset and utf8_general_ci for collation [15:13:18] our databases are a bit special in that we normally use binary as default because of wmf mediawiki [15:13:31] so, create database nova_cell0_eqiad1 character set utf8 collate utf8_general_ci ? [15:13:35] andrewbogott: ^ [15:13:36] yes please :) [15:13:57] just note no emoji support! [15:14:20] so no utf8mb4...? [15:14:30] I won't feel bad telling users they can't use emojis :) [15:17:08] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) @Andrew requested a different name for the database, so this is the new one with the desired charset/collation: ` root@db1133.eqiad.wmnet[nova_eqiad1]> create da... [15:17:33] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) [15:18:12] grants look good, I can connect to nova_cell0_eqiad1 from both controllers [15:18:49] andrewbogott: let's try to get rid of 'nova'@'%' now? [15:19:05] yep, let's try it [15:19:32] ok, let's see [15:20:27] andrewbogott: can you try to connect? [15:20:35] andrewbogott: 'nova'@'%' is gone [15:20:39] I can revert if needed [15:20:45] so far so good... [15:21:02] I have issued a flush privileges now, just in case [15:21:07] still good? [15:21:37] so far so good, going to try a complete end-to-end test [15:21:53] good [15:22:01] let me know if I need to revert [15:26:01] marostegui: all good [15:26:09] great! [15:26:22] so, last step, I am going to add the grants to the dump user for the backups [15:26:36] cool [15:26:40] I won't be putting that new database into service for a few days but I'll follow up on the ticket if anything interesting happens [15:27:57] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Andrew) [15:28:26] marostegui: I'm going to run out and get a bagel unless there are other checks you want me for [15:28:43] andrewbogott: go for it, I will close the task once I am done if that's fine [15:28:57] yep. Thanks again! (thanks jynus also) [15:29:16] thanks guys! [15:37:33] 10DBA, 10cloud-services-team (Kanban): Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 (10Marostegui) 05Open→03Resolved Grants for `nova_cell0_eqiad1` added to the `dump` users Closing this task, please re-open if needed Thanks everyone! [15:44:59] I am going to try to recover Archive backups into dbprov1001 [15:45:57] is it like it used to be? as in: you execute the recover and then fingers crossed to see if bacula doesn't wait long to pick up that job? [15:46:13] it is the manual process [15:46:30] it is the same as before but in theory, I have now "fixed it" after data migration [15:46:33] we'll see [15:46:39] ah ok :) [15:46:40] let me know [15:46:51] I FYI in case I "break" dbprov [15:47:04] sure :) [16:17:51] Building directory tree for JobId(s) 23606 ... [16:18:21] JobFiles 5,408,668 [16:18:53] we'll see how long it takes [21:09:25] 10DBA, 10MediaWiki-General, 10Operations: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10daniel)