[05:21:58] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4209355 (10Marostegui) a:05jcrespo>03Papaul It is indeed on predictive failure: ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure) ``` @Papaul can we replace it agai... [05:49:02] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209381 (10Marostegui) eqiad is now ready with all the data on multi-instance, so as soon as the final HW arrives we can just s... [05:49:41] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4209385 (10Marostegui) [06:10:25] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209388 (10Marostegui) [06:11:53] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209391 (10Marostegui) [06:16:05] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209394 (10Marostegui) [06:17:21] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209396 (10Marostegui) [06:20:12] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209398 (10Marostegui) [06:34:33] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209415 (10Marostegui) [07:26:09] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209436 (10Marostegui) [08:01:02] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209481 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db207... [08:01:10] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209482 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2075.codfw.wmnet'] ``` Of which those **FAILED**: ```... [08:01:37] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209483 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db207... [08:21:56] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4209492 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2075.codfw.wmnet'] ``` and were **ALL** successful. [08:31:55] <_joe_> hey, what's the bug regarding the issues with the mediawiki db load balancer? [08:32:10] <_joe_> specifically the one about having a slave malfunctioning causes an outage [08:32:20] I don't think much has been done [08:32:23] <_joe_> I want to try to bug some mw devs in barcelona [08:32:57] search SPOF on phabricator [08:33:17] https://phabricator.wikimedia.org/T180918 [08:33:38] I no longer think it is worth doing something at mediawiki side [08:36:01] use a similar model than github, with HAProxy https://githubengineering.com/context-aware-mysql-pools-via-haproxy/ [08:38:05] then setup our own health/monitoring system [08:38:18] as every time something gets fixed, 2 more problems appear [08:59:20] <_joe_> yeah, but while we wait for having implemented something like that [08:59:31] <_joe_> it would be good to at least partially fix this [09:00:08] <_joe_> tangentially, I think this is the kind of architectural change that would benefit from an RfC discussion [09:01:45] I disagree [09:02:07] mediawiki devels, or the lack of them, brought us to the current state [09:02:25] we should fix it at infrastructure level [09:06:20] https://phabricator.wikimedia.org/T156475 has been open for a year, meanwhile I am the one receiving the pages and attending outages [09:06:43] the problem is not the code, it is the architecture [09:25:33] about to merge https://gerrit.wikimedia.org/r/433015 [09:26:00] +1 [09:29:34] I will give it a few minutes, then kill all sleeping connections [09:30:16] sounds good [09:42:34] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4209615 (10Marostegui) [09:42:55] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4209616 (10Marostegui) [09:42:58] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4209617 (10Marostegui) [10:25:44] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4209704 (10Marostegui) [10:26:01] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4209707 (10Marostegui) [10:26:07] 10DBA: Drop unused tables: msg_resource msg_resource_links - https://phabricator.wikimedia.org/T194663#4204952 (10Marostegui) 05Open>03Resolved Dropped everywhere [10:27:11] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3500062 (10Marostegui) [10:57:20] something very strange has happened on db1120 (temporary sanitarium for s2, s4, s6 and s7), after restarting the server to clone the data to codfw, all the threads have gotten duplicate entry errors [10:58:34] The funny thing is that the codfw host it was transferred to, it didn't suffer that [10:59:09] that host was built from a logical backup from db1102 (using mydumper) [10:59:23] did you dump also the mysql database? [10:59:46] yes, I dumped all databases [10:59:51] there you have the issue [11:00:29] position/log records on the gtid table? [11:00:35] yeah [11:00:46] I truncated it though before doing anything [11:00:53] But could be some other left overs I guess [11:01:10] also if you used gtid to auto-position [11:01:17] that will mess it up [11:01:21] I will rebuilt it to be 100% sure it is trustable [11:02:02] you have to reset slave all and start it without gtid [11:02:12] I did that [11:02:18] As well as truncating the gtid table [11:02:22] That's why I am surprised [11:02:53] but the relay files get loaded on start [11:03:19] that is what makes it crash when labsdb get copied [11:03:29] well, the info/etc metadata [11:03:43] I did the reset slave all too before doing anything else [11:03:52] And it is multi-instance, so… [11:04:04] ? [11:04:06] I am going to rebuild it again, it doesn't take long [11:04:22] reset slave all only works for the current connection, did you do it for each one? [11:04:29] yes [11:04:43] so I think it is the gtid weirdness [11:04:51] yeah, I will leave aside mysql database [11:04:53] the same thing that made labsdb break [11:05:01] so no chances of gtid table messing up this time [11:05:04] I will report back :) [11:05:35] let's see if a restart also crashes the codfw host that was copied from db1120 [11:06:18] it doesn't :| [11:06:21] it was copied from db1120 [11:06:25] very weird [11:07:47] gtid + binlog/relay/info name [11:13:15] it may actually happen again if you copy it back and not remove the leftover files [11:13:27] yeah, I have removed everyting [11:13:33] I want to do it from 0, I don't want risks [11:13:51] I am leaving aside the mysql database this time, just to be 100% sure [11:19:37] "mariadb, you have failed me for the last time" https://youtu.be/aV2DLkDPwM8?t=49s [11:20:00] hahahahahaha [11:20:03] hahaha [11:21:17] to many "ha"s, be careful not to choke on your aspirations https://youtu.be/JIeftsgca5U?t=3m10s [11:22:45] after lunch, I may be reimaging s1 and s6 masters from codfw [11:23:46] note also those hosts may be registering server-side triggers with different id [11:23:55] which may create confussion [11:26:40] sure, check this when you can (no rush): https://gerrit.wikimedia.org/r/#/c/433343/ [11:26:43] I am going for lunch too [13:31:33] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#4210107 (10jcrespo) [13:31:37] 10DBA, 10Packaging, 10Patch-For-Review: Bug on mariadb systemd unit on stretch for multi-instance hosts - https://phabricator.wikimedia.org/T194516#4210105 (10jcrespo) 05Open>03Resolved This is now fixed on the latest package, and all affected (multiinstance) hosts, restarted. Heads up for m-i hosts in c...