[05:05:57] 10DBA, 10Operations, 10ops-codfw: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) [05:06:11] 10DBA, 10Operations, 10ops-codfw: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) p:05Triage→03Normal [05:14:42] 10DBA, 10OTRS, 10Operations, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) >>! In T226952#5316025, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.o... [05:17:57] 10DBA, 10Operations, 10ops-codfw: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) As per my chat with @Papaul I rebooted the host a second time and the previous error didn't show up. [05:21:33] 10DBA, 10OTRS, 10Operations, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10jcrespo) ` $ ./replication_tree.py db1065 db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991 +... [05:23:30] 10DBA, 10Operations, 10ops-codfw: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) @Papaul and myself chatted about this and the plan is to: - Clear logs (I just did) - Upgrade firmware, BIOS etc - Leave this task open for a week to see if it happens again and if not c... [05:25:13] jynus: can you review? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519975/ [05:27:17] did you compared grants between hosts? [05:27:22] yep [05:27:29] the topology changes are failing [05:27:36] I ran this: [05:27:38] failing? [05:27:42] ./switchover.py --only-slave-move db1065.eqiad.wmnet db1132.eqiad.wmnet [05:28:04] https://phabricator.wikimedia.org/P8728 [05:30:00] by looking at the logs, it looks like the move was actually attempted [05:31:05] [Warning] InnoDB: Difficult to find free blocks in the buffer pool (21 search iterations)! [05:31:11] [Note] InnoDB: Consider increasing the buffer pool size. [05:31:37] but that's earlier, right? [05:31:48] but worrying [05:31:57] replication stopped in log 'db1065-bin.000251' at position 436827766 [05:31:59] yeah [05:32:12] Slave SQL thread exiting, replication stopped in log 'db1065-bin.000251' at position 436828506 [05:32:34] gtid was disabled correctly [05:33:51] so it complained because they didn't stop at the same time, , the others is at 436827145 [05:34:15] interesting [05:34:24] maybe I need to decrease the timeout then? [05:34:43] what was the timeout? [05:34:47] default [05:34:59] [Warning] Timeout waiting for reply of binlog (file: db1132-bin.000022, pos: 415377884) [05:35:09] Slave SQL thread exiting, replication stopped in log 'db1065-bin.000251' at position 436827145 [05:35:22] let's try with a lower timeout then [05:35:34] they need to start again [05:35:37] and be in sync [05:35:39] I just did [05:36:30] so nothing was lost, they just stopped and when it was detected it was not on the same coord [05:36:43] it exitted [05:36:43] which is good [05:36:54] trying again [05:37:09] success now [05:37:28] topology looking good [05:37:33] maybe there is more likely to be lag on m2? [05:37:42] 5 seconds is too little default? [05:37:54] too much default you mean? [05:37:58] too much timeout I mean [05:38:16] I wonder what would be the case with core slaves [05:38:33] or maybe with larger transactions [05:38:43] it is more complicated [05:38:54] yeah, with OTRS I wouldn't be surprised [05:40:53] I definitely can restart replication if it fails top stopp properly [05:41:03] and exit [05:41:54] yeah [05:41:59] that's good [05:42:02] one thing I've seen [05:42:16] is there is more traffic on db1117 since 3am [05:42:38] maybe backup is running, contributing to more load, or something [05:42:42] yeah [05:42:44] it is dumps [05:42:47] I can see it now [05:42:54] it is still running [05:44:30] so more like it aborted when it saw a strange status [05:45:22] so maybe during dump it may have lower performance [05:45:28] leading to the issue [05:45:53] could be, the good thing is that it acted well, as in: I cannot do this, I stop [05:46:43] but as far as I can see, it just stopped and disabled gtid [05:46:55] yeah, that's what I saw too [05:46:57] but it didn't exectute change master or anything strange [05:47:00] per the logs in db1117 [05:47:33] the execute did happen [05:47:36] or that's what logs say [05:47:49] change master but for gtid [05:47:54] but only for gtid [05:47:54] yeah [05:48:02] not to another host [05:48:07] yep [05:48:08] that's good [05:48:20] the change master for gtid is only for master_use_gtid [05:48:34] yes [05:48:51] it could be that the stop was disruptive [05:49:02] and led to it lagging [05:49:08] and it may need a sleep [05:49:08] maybe the stop took longer than expected [05:49:22] sometimes stop/start take some hit on lag, it is not i,,ediate [05:49:30] that is why the timeout exists [05:49:39] maybe I can add a sleep after change master [05:56:05] yeah, it disables gtid on a replicas just befor stopping it again for topology change [05:56:19] that may affect lag [05:56:29] ah maybe yeah, if it takes longer to disable [05:56:34] then you've got some seconds of lag indeed [05:56:41] it is literally: [05:56:44] replication.set_gtid_mode('no') [05:56:50] result = replication.move(new_master=slave_replication.connection, start_if_stopped=True) [05:57:04] we may need some pause there [05:57:10] yeah [05:57:20] in very loaded hosts it can indeed generate lag [05:57:34] which also happened to be under quite some loadf [05:57:42] so that would explain it I think [05:57:45] yep [05:57:51] either add more timeout [05:57:54] nice race condition we caught! [05:57:56] or an extra sleep there [06:08:59] 10DBA, 10OTRS, 10Operations, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) This was done successfully. Read only start: 06:00:31 UTC 2019 Read only stop (and proxies reloaded): 06:00:40 UTC 2019 Total read o... [07:30:17] 10DBA, 10OTRS, 10Operations, 10Operations-Software-Development, 10Recommendation-API: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) 05Open→03Resolved a:03Marostegui [07:30:18] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [07:30:22] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:54:52] I have added an misc switchover example using the new scripts: https://wikitech.wikimedia.org/w/index.php?title=MariaDB&type=revision&diff=1831845&oldid=1831559 [07:55:00] Will update also production one once s8 is done [07:55:33] there is also https://wikitech.wikimedia.org/wiki/MariaDB#Manipulating_the_Replication_Tree [07:55:53] yep, that needs update indeed [07:58:42] with the +1 here https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/521226 are you implying to merge but reevaluate later? or to not merge [07:58:50] 10DBA, 10Data-Services: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) [07:59:02] to merge for now [07:59:08] ok [07:59:09] and then discuss at some point if we want to have a --force [07:59:12] or --check-only [07:59:13] or something [07:59:23] to bypass that check [08:02:16] 10DBA, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Marostegui) [08:03:49] 10DBA, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Marostegui) Let's wait a few days before actually starting to decommission it. I have disabled notifications though [08:05:00] 10DBA, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Marostegui) [08:46:23] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:47:37] 10DBA: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [08:48:35] 10DBA: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [10:04:36] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [10:05:26] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [18:24:48] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata.org, 10Story: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10Addshore) >>! In T68025#5314328, @ArielGlenn wrote: > Are there visible graphs for these? For which in particular? If you mean dai... [18:40:55] 10DBA, 10MediaWiki-General-or-Unknown, 10Core Platform Team Workboards (Clinic Duty Team): Investigate query planning in MariaDB 10 - https://phabricator.wikimedia.org/T85000 (10Anomie) I found that the fifth possibility in T85000#936374 seems to do the right thing on 10.1.39. ` lang=sql wikiadmin@10.64.0.92...