[00:26:56] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2090 is CRITICAL: 9.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104
[00:28:20] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2090 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104
[01:10:30] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10greg) Adding in the #DBA tag explicitly so it's seen...  >>! In T276968#6899053, @dduvall wrote: >  > I'm not totally sure how to proceed from her...
[05:11:02] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1149 is CRITICAL: 21.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104
[05:16:50] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1149 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1149&var-port=9104
[05:21:03] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) Those erros aren't a good thing unfortunately, it looks like InnoDB is very corrupted.  Do you still have a logical dump (done via mys...
[05:21:52] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) Thank you Papaul, that made the server come back to life. Which is good news, as we are now fully aware that T216240 is an issue.
[05:22:05] <wikibugs>	 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui)
[05:23:35] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) If db06 slave is still up, why not taking a mysqldump from that one?
[05:25:16] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) Table check still on-going and it will take a lot more (possibly more than 24h)
[05:26:46] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Majavah) The mysqldump result is located (at least) on deployment-db07:/srv/backup. db06 is still up, yes.
[05:30:44] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) Another option would be to:  - Assume db06 would be the new master. Switch mysql on that host off, copy its datadir to another new hos...
[05:37:13] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) db1175 and db2102 checked, now rebuilding some tables.
[07:16:54] <wikibugs>	 10DBA, 10DC-Ops, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui)
[07:17:05] <wikibugs>	 10DBA, 10DC-Ops, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) p:05Triage→03High
[07:17:40] <wikibugs>	 10DBA, 10DC-Ops, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui)
[07:17:42] <wikibugs>	 10DBA, 10SRE: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui)
[07:17:45] <wikibugs>	 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui)
[07:45:20] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) s6 is done, only pending the master switchover.
[07:45:37] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui)
[07:47:23] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui)
[08:11:50] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui)
[08:18:15] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui)
[08:23:11] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:24:55] <wikibugs>	 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 - https://phabricator.wikimedia.org/T276448 (10Marostegui)
[08:31:53] <wikibugs>	 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 - https://phabricator.wikimedia.org/T276448 (10Marostegui)
[08:39:44] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui)
[08:39:48] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui)
[08:46:10] <wikibugs>	 10DBA, 10Patch-For-Review: Failover m1 master: db1080 -> db1159 - https://phabricator.wikimedia.org/T276448 (10Marostegui)
[08:56:41] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) I have disabled notifications for this host, as it is likely to be down/under maintenance for some days
[09:07:55] <wikibugs>	 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate the need to generate and maintain zookeeper backups - https://phabricator.wikimedia.org/T274808 (10jcrespo) p:05Triage→03Low I will reuse this ticket as the implementation one, but with low priority for now.
[09:10:58] <wikibugs>	 10Data-Persistence-Backup, 10Analytics-Clusters: Implement production zookeeper backups - https://phabricator.wikimedia.org/T274808 (10jcrespo)
[09:13:48] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10aborrero) FYI: The root cause for the corruption could be a force-reboot force-migration that I had to perform on this host while operating the un...
[09:20:03] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10elukey)
[09:21:34] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10elukey)
[09:43:59] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) s7 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1174 [] db1170 [] db1155 [] db1136 [] db1127 [] db1125 [] db1116 [] db1101 [] db1...
[09:44:02] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) s7 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1174 [] db1170 [] db1155 [] db1136 [] db1127 [] db1125 [] db1116 [] db1101 [] db1098 [] db1086 [] db1079 [] cloud...
[10:13:01] <wikibugs>	 10DBA: Drop testreduce and testreduce_vd  from m5 master - https://phabricator.wikimedia.org/T276787 (10Marostegui)
[10:20:59] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s5 and s8 are now up and replicating
[10:21:55] <elukey>	 marostegui: \o/
[10:23:13] <marostegui>	 \o/
[10:23:41] <elukey>	 so we are done?
[10:24:16] <marostegui>	 elukey: Section-wise yes, I am still running a few things, which hopefully will be finished today. I will comment once done and with the next steps needed
[10:24:18] <elukey>	 I mean, ready for beta testing (sqoop etc..)
[10:24:25] <elukey>	 ack super
[10:24:27] <elukey>	 thanks a lot
[10:24:45] <elukey>	 I'll ask to my team to run some tests when you'll give us the green light to see how it performs
[10:24:53] <marostegui>	 sure!
[10:25:05] <marostegui>	 elukey: you'd still need the views though (I will add that to the comment)
[10:26:57] <elukey>	 ah right that part is a bit obscure to me
[10:27:04] <elukey>	 does it need the cloud db team?
[10:27:57] <marostegui>	 yep
[10:28:03] <elukey>	 ack
[10:28:07] <marostegui>	 Don't worry, I will sum everything on the comment :)
[10:36:00] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10LSobanski) @elukey thanks for reaching out, a few questions: - Is the expectations to do backups continuously or at fixed points in time? - Is the cluster in...
[11:47:39] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10elukey) Adding my thoughts about it, then my team will be able to comment :)  >>! In T277015#6899852, @LSobanski wrote: > @elukey thanks for reaching out, a...
[11:49:43] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) All the sections have been started and are now in sync with their masters. I have run a check...
[11:49:46] <marostegui>	 elukey: ^ :-)
[11:51:29] <elukey>	 thanks a lot <3
[11:51:34] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) @razzi can you follow up with @Bstorm about the next steps? :)
[11:59:41] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10jcrespo) > we don't have particular requirements for the location  The answer to that would mostly be motivated by: how much time could you wait for the reco...
[12:11:01] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10elukey) @jcrespo thanks for the infos, lemme add more notes:  * A day was a random value that picked turned out to be very wrong, I think that we can wait ev...
[12:12:07] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Tgr) >>! In T276968#6899299, @Marostegui wrote: > - Assume that maybe db06 might not have the exact same data as db05 if everything wasn't entirel...
[12:14:49] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) >>! In T276968#6900193, @Tgr wrote: >>>! In T276968#6899299, @Marostegui wrote: >> - Assume that maybe db06 might not have the exact s...
[12:18:49] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Majavah) > would both hosts have the same version  How important is this? The old hosts db05 (now-deleted with disk corruption) and db06 have Mari...
[12:19:59] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) >>! In T276968#6900223, @Majavah wrote: >> would both hosts have the same version >  > How important is this? The old hosts db05 (now-...
[12:28:08] <wikibugs>	 10Data-Persistence-Backup, 10Analytics: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10jcrespo) > Practically this might be a little problematic in a data recovery scenario  Yes, this is something that I expected, as we had a similar kind of de...
[12:42:55] <jynus>	 wow, that is a lot of clouddb* instances on tendril :-)
[12:43:17] <marostegui>	 and the good thing is that if one fails, it might not affect the others! :p
[12:43:27] <jynus>	 congrats, only 3 labsdb* hosts missing!
[12:43:38] <marostegui>	 well, one of them went on strike yesterday night
[12:43:44] <jynus>	 I heard
[12:44:24] <marostegui>	 It is very suspicious, when server know their time are arriving they start crashing.It also happened with dbstore1002 a year ago
[12:44:40] <kormat>	 🕵️‍♀️
[12:44:55] <marostegui>	 That wasn't the case with labsdb1001 and labsdb1003 as they were always crashing \o/
[12:45:04] <jynus>	 you mean when dbstore1002 was literally 100% of our backup systemd?
[12:45:11] <jynus>	 *system
[12:45:12] <marostegui>	 No, it was dbstore1001!
[12:45:15] <jynus>	 ah, true
[12:45:20] <marostegui>	 Which also started crashing XD
[12:45:44] <jynus>	 it is difficult to notice because we think things go much slower than desired
[12:45:58] <jynus>	 but we are in a way better state than 5 years ago
[12:46:26] <marostegui>	 definitely!
[12:47:08] <jynus>	 I think I told you before than my first year was 90% of the time attending labsdb100[1-3] crashes/replication breakages
[12:47:37] <marostegui>	 in the end it was set to idempotent XD
[12:48:06] <jynus>	 and then 1 ticket almost every day from users noting drifts
[12:48:29] <jynus>	 we have no more of those since ROW
[12:51:06] <jynus>	 I remember once rebuilding a labsdb host online for 2 days, completely manually only for drifts to happen the next day
[12:52:44] <jynus>	 speaking about the future, and this is more for cloud than for you-
[12:53:39] <jynus>	 given the number of clouddb instances, we could setup something like production db provisioning on cloud network to speed up recoveries, if that would help
[12:54:01] <jynus>	 e.g. someting in the lines of "clouddbprov"
[14:00:38] <jynus>	 Am I on the wrong url?
[14:00:43] <jynus>	 for the meeting?
[14:33:53] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui)
[14:53:56] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10zeljkofilipin)
[15:58:14] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) While checking all the tables, mysqld crashed again (it was on commons at the time). I am going to go for another option which is start section by section so at lea...
[15:58:59] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s1 made it crash. Going for s2 now
[16:02:31] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s2 and s3 are now replicating. Let's see what happens after a few hours.
[16:06:23] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) db1150:3314 being checked
[16:07:05] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10dancy) >>! In T276968#6899605, @aborrero wrote: > FYI: The root cause for the corruption could be a force-reboot force-migration that I had to per...
[16:24:51] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[16:25:32] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[16:25:35] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10dduvall) Thanks for the help, everyone. I would still like to get off of db06 if possible at the end of this process since we have to finish the b...
[16:26:11] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10dduvall) Forgot the `UNLOCK TABLES` on db07 :)
[16:37:45] <marxarelli>	 hey all, we're still dealing with fixing the beta dbs over in #wikimedia-releng and i was wondering if anyone has thoughts on https://phabricator.wikimedia.org/T276968#6901184
[16:38:39] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) @dduvall if possible I would also set `read_only=ON` on the current master (I guess db06) to be fully sure no writes are happening. If...
[16:39:11] <marostegui>	 marxarelli: I just commented
[16:39:15] <marxarelli>	 ah, thank you!
[16:39:49] <Majavah>	 db06 has already set global read_only=on; fwiw
[16:40:23] <marostegui>	 Majavah: then nothing should be advancing on the slaves, so they all should be in the same binlog/pos and you could proceed with marxarelli's plan
[16:40:57] <Majavah>	 thanks, still finishing the transfers
[16:41:09] <marostegui>	 To be honest, I would do it in different steps, set db06, make sure all is fine and the slave replicates just fine. And once that is fully ok, then proceed with marxarelli's plan
[16:42:10] <marxarelli>	 your plan sounds good to me. let's do that Majavah 
[16:42:29] <Majavah>	 sure, also ping Urbanecm
[16:42:36] <marxarelli>	 Urbanecm: ^
[16:46:31] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10dduvall) From @Marostegui in IRC: "To be honest, I would do it in different steps, set db06, make sure all is fine and the slave replicates just f...
[16:50:39] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s2 and s3 synced with their master with no problem. Going to go for s4 now.
[16:53:16] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[16:53:40] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) Sounds good @elukey.  Thanks for your speedy data population @Marostegui! Responding to the firewa...
[16:54:43] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) @razzi yeah, it was all fixed by removing the DNS IPv6 record. Nothing else required.
[17:05:20] <Urbanecm>	 marostegui: hey, do we need to make slaves r/w on mariadb level? or will replication just ignore it?
[17:05:44] <marostegui>	 no, slaves can and should remain as read-only
[17:05:49] <marostegui>	 Replication isn't affected by that
[17:05:55] <Urbanecm>	 okay, thanks!
[17:10:00] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s4 seems good, still catching up with no issues. Just started s5...we'll see
[17:13:08] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Urbanecm) 05Open→03Resolved a:05dduvall→03None This was done. Thanks everyone, especially @majavah who de-facto leaded this change!
[17:13:52] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Marostegui) Happy it worked fine!
[17:14:31] <Majavah>	 marostegui: we're still on the Stretch master, we're planning to make sure that this works for a day or so and then switch over
[17:14:37] <Majavah>	 but the new replicas look to be working
[17:15:04] <marostegui>	 Majavah: Excellent!
[17:25:46] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s5 sync'ed fine. s4 still on going s6 just started
[17:35:18] <wikibugs>	 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) 05Stalled→03Open
[18:18:02] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s6 sync'ed well with its master s4 still catching up nicely s7 just started and I noticed `InnoDB: load corrupted index index "rc_ns_actor" of table "arwiki"."recen...
[18:19:50] <jynus>	 interesting read: http://www.tusacentral.net/joomla/index.php/mysql-blogs/232-who-is-drop-in-replacement-of
[18:21:58] <marostegui>	 Yeah, which confirms everything we know already. We can only go back with a logical dump 
[18:22:30] <marostegui>	 :(
[18:23:15] <jynus>	 that part we "knew", the second part at the end was a surprise (the slide)
[18:23:29] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) Fixed: ` mysql:root@localhost [arwiki]> check table recentchanges; +----------------------+-------+----------+----------+ | Table                | Op    | Msg_type...
[18:27:01] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10zeljkofilipin)
[18:27:16] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10User-notice: deployment-db05 needs replacing following disk corruption - https://phabricator.wikimedia.org/T276968 (10Majavah)
[19:33:21] <wikibugs>	 10DBA, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Cannot access the database: Unknown error (10.64.0.164) - https://phabricator.wikimedia.org/T277088 (10brennen)
[19:40:03] <wikibugs>	 10DBA, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Cannot access the database: Unknown error (10.64.0.164) - https://phabricator.wikimedia.org/T277088 (10Krinkle)
[19:47:31] <wikibugs>	 10DBA, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Cannot access the database: Unknown error (10.64.0.164) - https://phabricator.wikimedia.org/T277088 (10Umherirrender) See also T268715 for the same error text
[20:02:34] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s7 and s4 synced correctly. I have started s8, which looks corrupted. I will try to fix the inconsistencies tomorrow and see if it replicates fine.  ` Mar 10 20:01:...
[21:26:35] <wikibugs>	 10DBA, 10netbox: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10Krinkle) @Marostegui Should we wait with adopting x2 for mainstash-db until this is done?
[21:28:55] <wikibugs>	 10DBA, 10netbox: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10Marostegui) I don't think it would matter much. This might take sometime. Whatever solution we come up with needs to be applied everywhere, so just 6 more hosts will not make much of a difference
[21:29:12] <wikibugs>	 10DBA, 10netbox: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10Krinkle)
[21:29:14] <wikibugs>	 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Krinkle)
[21:30:17] <wikibugs>	 10DBA, 10netbox: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10Krinkle) Thanks, I've detached it from the tree blocking T212129 / T113916 / T270223.
[23:49:17] <wikibugs>	 10DBA: fa_deleted_timestamp is binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 (10Ladsgroup)
[23:52:33] <wikibugs>	 10DBA: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 (10Ladsgroup)