[04:16:50] order of db backups is a bit meh [04:16:58] morning! :) [04:19:14] I found a difference on page between db1068 and db1081 [04:19:36] :/ [04:19:40] big one? [04:19:45] like many rows? [04:19:50] but then I checked several times and it was a supuriess false possitive [04:19:56] pheeeew [04:19:58] *supurious [04:20:14] which we now is possible, but first time it happened to me [04:20:25] I have had many false positives in the past [04:20:29] with very concurrent pages [04:20:31] *tables [04:20:39] yeah [04:21:33] I finished the logic for the replica move [04:21:41] oooh sweet! [04:21:56] but it is like 3 pages of code I have not tested even onece [04:22:21] I will do first with local dbs, so not deploying at all [04:22:28] sounds sane yeah haha [04:22:56] just remember to rebase on your local one, or use my local repo [04:23:44] yeah, I am going to rebase mine [04:23:58] if you do it on your local one [04:24:17] remember you need to stash and pop a change I did [04:24:21] for the namespace [04:24:45] or CuminExecution won't work [04:24:47] or I can not rebase and use the old one? [04:25:30] up to you, the one I tested was the one on my home [04:25:56] ok, I will check later, going to prepare all the pre-stuff [05:14:50] 10DBA, 10Operations: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) This happened successfully. Read only times (UTC): Start: 05:01:02 Stop: 05:03:20 Total read only time: 2:18 minutes [05:15:07] 10DBA, 10Operations: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) [05:23:20] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) [05:23:22] 10DBA, 10Operations: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) 05Open→03Resolved So far everything looks good, so closing this. [05:24:16] So I wonder if there has been an impact on how deployments work with the new php7 workflow? [05:24:33] we can ask _joe_ [05:24:43] and that means less stale code -> less errors [05:24:47] this is the first failover we do since php7 has been deployed, no? [05:25:10] well, php7 in theory is not a majority yet [05:25:18] yeah I know [05:25:36] but maybe the deployment procedure changed a bit [05:26:10] did you depool the master in advance? [05:26:16] maybe that also helped [05:26:20] yeah [05:26:38] 04:19 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Depool db1081 T224852 (duration: 00m 57s) [05:26:38] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:26:41] so 40 minutes in advance [05:27:07] so that or the lower load that commons indeed has at the moment [05:28:27] it is doing 23K OPS [05:28:42] while at peak it does double that [05:29:22] great work manuel, I think it cannot get better than that [05:29:33] once we have dbctl we will! [05:29:47] I am thinking about rebuilding the candidate master for s4 with db1068's (old master) data [05:29:49] what do youthink? [05:29:53] I said better, not faster due to deployment procedure! [05:30:08] which is the new candidate? [05:30:17] (sounds ok to me) [05:30:30] the new candidate was rebuilt out from the candidate master which is a the master now [05:30:45] so db1138 was rebuilt out of db1081 (current master) [05:31:01] 138, so I guess new/recent? [05:31:10] db1138 is new yep [05:31:16] cool [05:31:43] was db1068 still showing memory issues, or only now and then? [05:31:46] yeah [05:32:39] I am worried mostly now about replication lag specially after day ramp up [05:33:08] I will decomm db1068 on monday I think, so it will be a good test to see how it can replicate and if it is able to do so [05:33:22] yeah, no rush [05:33:59] lag specially for codfw and labs [05:34:12] codfw is now running only big hosts [05:34:23] yeah, but still [05:34:37] we no longer we have a "break" [05:34:43] hehe yeah [05:35:03] we can play around with semisyc and db1068 [05:36:01] yeah, we will see at peak [05:37:02] I am going to upgrade it to 10.1.39 [05:37:08] ok [05:37:22] I uploaded to https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/517794/1/wmfmariadbpy/WMFReplication.py what I did yesterday [05:37:34] oh! I will take a look :) [05:37:48] once tested, it should be able to 100% replace repl [05:37:52] repl.pl [05:38:00] our lovely repl.pl! [05:38:55] I am not going to reboot db1068 for now, I am afraid of it not coming back [05:40:36] he he [05:41:08] root@db1068:~# uptime [05:41:09] 05:41:04 up 274 days [05:41:15] that plus the memory errors....I won't tentar a la suerte :) [05:41:44] the other thing that etcd will give us is a depool+upgrade+restart mostly automatic procedure [05:41:54] yeah can't wait for that [05:56:49] 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) The failover was done, so we can probably keep compressing tables. @jcrespo let me know if you would like to handling this yourself or you want me to take over so you can focus on backups :) [06:28:00] <_joe_> (I wasn't around and still ain't in full) [06:38:53] 10DBA, 10Patch-For-Review: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1112.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906190638_marostegui_2... [06:57:24] 10DBA, 10Patch-For-Review: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1112.eqiad.wmnet'] ` and were **ALL** successful. [07:54:42] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) So this is the change I will push the 25th of June to change the last key: https://g... [09:06:17] The MariaDB Foundation is pleased to announce the availability [09:06:17] of MariaDB 10.4.6, the first stable release in the MariaDB [09:06:17] 10.4 series [09:10:17] "stable" [09:10:54] haha yeah [09:11:46] "Crash safe Aria-based system tables" [09:20:09] 10DBA: Replace db1077 with db1112 - https://phabricator.wikimedia.org/T225981 (10Marostegui) db1112 is now cloned from db1077. I am going to let it replicate for 24h before changing sanitarium to replicate from it and to pool it in s3. [10:22:26] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) [10:23:13] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Marostegui) 05Stalled→03Open [10:23:58] 10DBA, 10Operations, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) This host is no longer a master and will be decommissioned in a few days [10:25:54] jynus: I am going to upgrade db2098 today I have checked and both snapshost have already happened (s2 and s3) - I need it to run 10.1.39 so I can failover s2 codfw master in the next days [10:26:18] ok [10:27:36] apparently not a single failure on the last run [10:28:19] any significant change that can lead to that? apart from the memory change? [10:29:06] also try to purge db1112 before thur-fry run or it may run into space issues [10:29:19] (dbprov1001) [10:29:32] I will do it tomorrow [10:30:07] as I am leaving db1112 to replicate for 24h before moving sanitarium under it [10:30:46] yes, I expected that [10:31:07] there is some optimization to be done on backup order to prevent that [12:47:58] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10Marostegui) As per my chat with @Reedy the code is merged and he's done some testing and it looks good, so I will try to get this schema change done during this week. [13:30:40] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10Marostegui) [13:31:21] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10Marostegui) All the private wikis have been altered: ` advisorswiki `module` varbinary(255) NOT NULL, `data` blob, arbcom_cswiki `module` varbinary(255) NOT NULL,... [13:32:30] 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-OATHAuth: Schema change to oathauth_users - https://phabricator.wikimedia.org/T225643 (10jcrespo) [14:26:56] 10DBA: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10jbond) [16:54:02] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) a:05Cmjohnson→03RobH I updated the switch config to private1-d.....both servers are currently off and ready for installs. assigning to @robh to install [16:54:19] 10DBA, 10Operations, 10ops-eqiad: eqiad: rack/setup/install (4) dbproxy systems. - https://phabricator.wikimedia.org/T225704 (10Cmjohnson) [16:54:22] 10DBA, 10MediaWiki-Database, 10Core Platform Team (Multi-DC (TEC1)), 10Core Platform Team Backlog (Watching / External), and 5 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10aaron)