[06:17:39] 10DBA: Duplicate key on s5 codfw hosts - https://phabricator.wikimedia.org/T277632 (10Marostegui) [06:17:49] 10DBA: Duplicate key on s5 codfw hosts - https://phabricator.wikimedia.org/T277632 (10Marostegui) p:05Triageβ†’03High [06:28:01] PROBLEM - MariaDB sustained replica lag on db2137 is CRITICAL: 1.674e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315 [06:37:51] RECOVERY - MariaDB sustained replica lag on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315 [06:42:30] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) [06:47:54] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [06:48:03] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db2150 is now pooled in s7 [06:55:02] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) On going transfer from db1082 to db1161 [08:16:28] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1161 is now replicating [08:59:19] 10DBA, 10decommission-hardware: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Marostegui) Depooled [09:27:17] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) So to sum up for this testing databases: Section: m5 Name: `testmailman3` and `testmailman3web` Approximate time frame before deleting them: 2-3 months I would need the users an... [09:32:11] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10jcrespo) @Ladsgroup I assume no backups needed as this is a test? [09:51:57] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6920684, @jcrespo wrote: > @Ladsgroup I assume no backups needed as this is a test? correct [09:53:45] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6920675, @Marostegui wrote: > > I would need the users and the desired grants. user `testmailman3` having all rights on `testmailman3` database user `testmailman3... [10:11:48] jynus: yo. both of the backup sources for s5/codfw need to be rebuilt (T277632). is there something that needs to be done on the backup side to allow for this? [10:12:01] https://phabricator.wikimedia.org/T277632 [10:13:02] what happened? [10:13:02] (i'm thinking of doing the backup source hosts first, as the other 2 nodes aren't used for anything right now) [10:13:09] jynus: see the ticket? [10:13:23] yes, but the ticket didn't say why [10:13:26] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) a:03Marostegui [10:13:35] Amir1: ^ changed the title task to include "test" there, so it is clearer ^ [10:13:48] jynus: i can only assume from the contents of the ticket that there was corruption [10:14:17] that happened before on the same table [10:14:45] jynus: My guess is that at some point one of those hosts got corrupted and others were recloned from that original one [10:16:55] but this is more problematic- it means all backups we have are useless [10:18:23] we can just rebuild the backup source from one of the hosts that didn't show any issues, and once that is done, we can rebuild the rest [10:19:16] ok [10:19:41] ok, so it is only codfw [10:19:45] after all this only broke 4 hosts out of all the hosts in eqiad in codfw, so I am assuming those are the bad ones [10:19:49] the backups on eqiad should work [10:20:23] but I think we had a similar error on that table not a long time ago [10:20:43] but if we use eqiad backups then we need to cross replicate from codfw to catch up and then move it under codfw [10:20:49] I don't mind either way, up to kormat :) [10:23:30] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) @Ladsgroup from which hosts would you be connecting from? m5 doesn't use the proxy (yet), so I would need to grant certain IPs instead of the proxy ones. [10:23:52] marostegui: i don't follow. why would we need to cross-replicate? [10:24:22] kormat: if you use eqiad backups, the coordinates from those logical backups will point to the eqiad master [10:24:33] oh. uff. [10:24:43] i hate everything? [10:25:02] so you'd need to: let them replicate cross dc (that's ok), once they've caught up, stop them in sync with codfw master and move them under the codfw master (we have a tool for that) [10:25:21] i hate everything. [10:25:54] not really- we can use gtid [10:25:59] Happy to help you with all that if you want! Another option is to reclone those 4 hosts from a host in codfw that didn't get the issue [10:26:01] but I wasn't proposing that [10:26:18] I meant that we needed to review previous backups and delete them [10:26:26] jynus: if you trust gtid, then sure ;) [10:26:39] marostegui: what could possibly go wrong? [10:27:09] also I was planning on moving s5 to ease the 10.4 upgrade [10:27:20] so let's setup a plan on the ticket [10:28:05] let's not complicate this with the upgrade stuff [10:29:15] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) Databases created on db1128.eqiad.wmnet (m5 master): ` # host m5-master.eqiad.wmnet m5-master.eqiad.wmnet is an alias for db1128.eqiad.wmnet. db1128.eqiad.wmnet has address 1... [10:36:59] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) Thanks. It should be the IP of the VM but that's not created yet (T276686) we were waiting for the databases to be created first (basically a chicken and egg problem) [10:37:35] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) Databases are now created, once I get the IPs I will create the users :) [10:49:58] I don't want to make it more complicated- but source hosts have more setup steps- compression, extra grants [10:50:47] I take care of them them myself if that is too complicated [10:52:18] at the same time I reorganize them [10:53:05] jynus: is this documented somewhere? [10:53:20] no [10:53:43] ok. then i'd be glad to let you handle the source hosts :) [10:53:55] which is a host that is "sane"? [10:54:17] we can create a backup, and then use it to populate the others, each at our own pace [10:55:04] needs to be 10.1 [10:55:06] e.g. you can do db2089:3315 and db2137:3315 [10:55:11] and I can do the sources [10:55:28] and saving me some time for the reorganization [10:55:58] how about db2113? candidate master [10:56:00] it's on 10.1 [10:56:29] so do I setup a backup there towards dbprov200X ? [10:56:43] SGTM [10:57:14] ok, then doing [10:57:18] πŸ‘ [10:57:20] will update ticket [10:57:34] and ping you when done, then we can work in parallel for each pair of servers [10:57:43] great, thanks [11:04:01] 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10LSobanski) [11:07:27] 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10LSobanski) [11:08:09] [11:07:43]: INFO - Running XtraBackup at db2113.codfw.wmnet:3306 and sending it to dbprov2001.codfw.wmnet [11:08:23] this will take a bit more time to start up but save a lot of time later [11:08:43] let me inform of the expected time to finish [11:08:45] 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10LSobanski) Initial idea mentioned using the anonymized view but the conclusion was that's not what we need. Leaving this as a comment here so that it's n... [11:12:03] it will take aproximatelly 1:30 minutes [11:14:57] marostegui: thanks! [11:45:00] this seems very relevant: T263842 [11:45:01] T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 [11:45:27] and T264701 [11:45:27] T264701: Re-evaluate the use of INSERT IGNORE on ipblocks - https://phabricator.wikimedia.org/T264701 [11:46:47] that table was checked months ago, so this will likely keep happening [11:50:08] so maybe we need to raise the priority of T264701 (which was pretty much ignored) [11:51:02] so do you think it points more to that or to a (compression?) related corruption? [11:51:25] I am sure that the table, at that time was the same [11:51:36] I think it is something from code [11:51:46] But it is hard to prove, as it would affected all the hosts [11:51:52] But having insert ignore isn't really nice anyways [11:52:41] yeah, it fits codfw because it uses statement for a longer step with more window for drifts [11:53:03] what do you mean? [11:53:40] so row, in theory, doesn't cause data drifts, so once it has been replicated once, most hosts use row [11:54:10] but there is 2 steps for statement towards codfw EqiadM->CodfwM->CodfwR [11:54:30] so if there it an undeterministic write, it is double more likely to cause issues on that patch [11:54:32] *path [11:54:38] *there is [11:54:47] aaah I see what you mean [11:55:16] Yeah, could be [11:55:21] I will reping on that task [11:55:23] e.g. a row is locked and that causes INSERT to ingore failures [11:55:51] but I asked about the other option (physical corruption) because you suggest logical restore? [11:56:30] ah, no, that was in other ticket [11:56:33] in the old one [11:56:39] I suggested logical restore to discard any physical corruption [11:56:47] oh, no, it was the the latest one [11:56:56] so you think still should do it? [11:57:55] it wouldn't hurt, but we can simply do a physical clone from any of the hosts that didn't break [11:58:11] for one reason or another, those didn't break [11:58:13] so, kormat to decide how to handle the production ones [11:58:14] so it must be safe [11:58:25] if it is a code issue, it wouldn't be prevented by a logical restore anyways [11:58:27] I think I will recover quickly the 10.1 source backup [11:58:32] with binary copy [11:58:39] and restore the 10.4 logicaly [11:58:47] as that will take longer [11:58:54] so we are not without backups for a long time [11:59:16] sounds good? [11:59:18] sure [12:01:58] https://phabricator.wikimedia.org/T264701#6921118 [12:59:22] [12:56:14]: INFO - Backup finished correctly [13:12:39] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Generate a "daily events snapshot" Agree that this will be needed for sure! It... [13:17:36] la copia estΓ‘ en /srv/backups/snapshots/ongoing/snapshot.s5.2021-03-17--11-07-42.tar.gz [13:18:03] I am going to proceed to recover it [13:18:31] because it is on a 10G network and hosts are on 1G, we can run several recovery processes at the same time, with no interference [13:18:43] ^ kormat [13:18:52] πŸ‘ [14:10:47] marostegui: can you unsubscribe me from that protected replication task? I had managed to subscribe it before it was made private and now I can't see it but Phabricator still sends me useless notifications about it [14:10:59] "2 notifications about objects which no longer exist or which you can no longer see were discarded." thanks Phab [14:11:03] Majavah: haha, sure! [14:11:33] Majavah: done [14:11:34] thanks! [14:13:14] phabulous [14:14:31] I'm not even sure why it didn't spam me with emails instead of just making those in the browser - "Other task activity not listed above occurs." is set as "email" and I don't see anything more relevant that would cover it [18:13:33] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Cmjohnson) a:05Cmjohnsonβ†’03RobH @RobH These are ready for you to finish the installs, I did verify that I was able to connect to mgmt on all of them. Use the temp password. [18:14:21] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Cmjohnson) [19:21:33] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) So I just checked and confirmed with Chris that when the add server interface script was run for each of these, the skip ivp6 checkbox was checked, but they all seem to ha... [19:21:57] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) bios, raid, and idrac firmware updated on all hosts for this task. [19:52:53] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` db1176.eqiad.wmnet ` The log can be found in `/var/l... [20:06:20] 10Data-Persistence-Backup, 10SRE, 10Goal: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) a:03jcrespo [20:06:40] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) a:03jcrespo [20:07:52] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) Having IPv6 allocated is fine as long as they don't have a DNS attached to it :-) [20:15:05] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1176.eqiad.wmnet'] ` and were **ALL** successful. [20:17:41] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) [20:21:59] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['db1177.eqiad.wmnet', 'db1178.eqiad.wmnet', 'db1179.eqiad.wmnet', 'db1180... [20:58:51] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1177.eqiad.wmnet', 'db1178.eqiad.wmnet', 'db1179.eqiad.wmnet', 'db1180.eqiad.wmnet', 'db1181.eqiad.wmnet', 'db1182.eqi... [21:31:04] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH)