[04:45:45] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui clouddb1021 has the analytics_multiinstance role applied, is configured to expect data on s1 and s3 only (https://gerrit.wikimedia.org/r/c/... [05:14:50] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Thanks @razzi. I have ack'ed the alerts on icinga and I will start working with these sections. If I successfully get them up and running today I... [05:16:32] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) db2102 is getting some of its tables fixed (rebuilt) [05:17:17] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) Checking tables on db1175 now [05:55:05] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s1 and s3 are now replicating on clouddb1021 (pending enabling GTID - will do it once replication is in sync). Host added to tendril and to zarcil... [05:58:22] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Changed clouddb1021 from planned to active on netbox. [06:44:17] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) I have adjusted a bit the buffer pool sizes. They might need further changing but we'll only know once the sqoops run. [06:48:02] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [06:49:22] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [06:51:11] marostegui: hola hola [06:51:18] thanks a lot for clouddb1021 [06:52:30] I am a bit confused about the firewall issue, in theory clouddb1021 should be in the private vlan [06:54:06] so telnet -4 clouddb1021.eqiad.wmnet 3311 works [06:54:30] I think that i tries with ipv6 first and hangs without it [06:54:56] going to remove the IPv6 entries in netbox then [06:55:25] I can do that if you like :) [06:57:30] done, running netbox's cookbook :) [06:58:42] fingers crossed [07:03:23] marostegui: done [07:03:40] should we leave the ipv6 interface on the host or do you also need that to be removed? [07:04:09] The interface can stay, that's no problem [07:04:15] Let's see what cumin does now! [07:04:36] I guess it will take a few minutes for the change to fully spread [07:05:01] yeah I was about to say, it still shows the AAAA record, the resolvers are still not up to date [07:05:02] yep, still resolving the ipv6 [07:05:03] yeah [07:05:06] # telnet clouddb1021.eqiad.wmnet 3311 [07:05:06] Trying 2620:0:861:101:10:64:0:118... [07:05:51] When telnet fails to ipv4 it does work: Trying 10.64.0.118... [07:05:51] Connected to clouddb1021.eqiad.wmnet. [07:05:51] Escape character is '^]'. [07:05:55] so looking good for now [07:06:01] let's wait then, thanks elukey [07:06:15] while we wait, ignorant question - did you have to go through the usual boilerplate steps to bootstrap a mariadb instance (removing remote root, tables, etc..) or due to the copy was already "done" ? [07:06:31] elukey: No, cause I am copying the data dir from another host [07:06:40] But I do have to go thru configuring replication [07:06:49] yes yes that makes sense [07:11:50] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [08:01:33] marostegui: the stretch linux 4.9 update has now been released (albeit there's no DSA yet due to arm builds being ongoing), I'm running some tests and will install fixed kernels fleet-wide once done [08:02:14] Oh nice! [08:02:17] Excellent news [08:16:56] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s2 and s7 are up and running [08:18:55] nice ---^ [09:17:27] m1 backups are alerting because bacula drop, I will ack them for a week [09:17:53] thanks [09:19:08] backup2002 is low on space, I will see if I can delete something [09:23:00] I got again the error "Could not read data from wikidatawiki.blobs_cluster26: Lost connection to MySQL server during query" [09:24:12] from both clients at the same time, so most likely backup host network [09:27:37] something weird happend at 2:54 on the backup host [09:33:25] marostegui: +1ed for s4 and s6 [09:33:33] thanks :* [09:34:01] marostegui: works fine in my tests, I have deployed 4.9.258 fleet-wide, maybe reboot 1-2 other less critical mysqls before the master, though [09:34:41] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [09:34:55] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) I have scheduled this for 23rd March at 06:00 AM UTC [09:35:01] moritzm: will do, thank you! [09:39:52] fyi, I am running es backups now, they have very limited concurrency so no impact should happen, but notifying here as normally they run over the night [09:40:08] also they are the codfw ones [09:40:18] so not impact to production traffic either [09:42:05] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [09:42:51] PROBLEM - MariaDB sustained replica lag on es2025 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2025&var-port=9104 [09:43:17] oh, maybe there is some impact [09:44:17] RECOVERY - MariaDB sustained replica lag on es2025 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2025&var-port=9104 [09:48:49] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) After s6 I will go for s7 as we need to switch that master in two weeks, we might as well get this schema change fully done on the master too [09:48:52] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) After s6 I will go for s7 as we need to switch that master in two weeks, we might as well get this schema change fully done on the master too [10:23:26] 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) When doing a reboot for T273280 I just ran into this issue with db2073. db2072 (which as a newer firmware rebooted just fine). [10:25:02] 10DBA, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) [10:25:21] 10DBA, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) p:05Triage→03Medium [10:31:06] I'm running drift checker from the abstract schema for the first time and it's a mess 🙃 [10:31:36] Lots of tickets will be created soon [10:32:34] haha [11:28:39] PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 17.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [11:29:49] RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104 [12:34:59] moritzm: I have upgraded the stretch kernel on 6 codfw hosts, one of them didn't come back, as we spoke in private most likely cause of T216240, let's see how the others do in the next few days [12:35:00] T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 [12:35:49] If they are ok, I will upgrade the candidate master for s7, as that is the switchover I have scheduled for 23rd March [12:36:07] And I guess at somepoint we need the list of hosts in T216240 to get their firmware upgraded [12:36:29] At least the candidate masters [13:04:03] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s6 eqiad progress: [] db1085 sanitarium master, api [] db1088 basic [] db1096:3316 [] db1098:3316 [] db1113:3316 dump,vslow [] db1125:3316 sanitarium [] db1131 master [] db1139 backup s... [13:04:26] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [13:06:46] kormat: ^ I am deploying a change on s6 too in eqiad, I am doing db1098 an db1096 in a bit, after that I will not do more hosts today [13:09:41] marostegui: perfect. that'll make it double-effecient [13:10:10] kormat: do you want me to ping you when I depool those two? [13:10:52] marostegui: sounds good, thanks :) [13:17:11] kormat: want to deploy your change on db1098:3316? [13:17:47] I will deploy mine after yours [13:17:54] db1098:3316 is depooled [13:18:01] marostegui: your commit message said db1198. tricky. [13:18:05] going now [13:18:07] typooo [13:18:36] marostegui: done [13:18:52] ok, I will deploy mine now, and then start repooling, will take like 1h [13:18:57] the whole process I mean [13:20:56] 👍 [13:44:15] 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s4 and s6 are now replicating [13:44:55] \o/ --^ [13:45:34] \o\ |o| /o/ [13:48:42] great work, everybody involved [13:49:51] cpu usage profile on backup2001 and backup2002 is very different. I need more reasearch why (misconfiguration or traffic-related?). [13:50:06] sorry, I meant backup 2002 vs backup1002 [13:51:31] ah, it is missconfiguration, and that is why lag was created before [13:52:08] and why I may have connection errors [13:52:32] what thing was misconfigured? [13:52:50] I am going to send a patch and add you are reviewer in 1 sec :-D [13:53:00] \o/ [13:58:54] marostegui, https://gerrit.wikimedia.org/r/c/operations/puppet/+/670174 [13:59:03] yeah, just commented :) [14:44:30] PROBLEM - MariaDB sustained replica lag on db2090 is CRITICAL: 28.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104 [15:06:23] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) [15:17:34] RECOVERY - MariaDB sustained replica lag on db2090 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104 [17:39:26] 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate the need to generate and maintain zookeeper backups - https://phabricator.wikimedia.org/T274808 (10Ottomata) Hello! Why not eh? I don't think this is high priority for us, but it is certainly a good idea. Feel free to reach out to any Data/Analytics E... [18:25:35] 10DBA, 10SRE, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Papaul) 05Open→03Resolved Complete Before ` BIOS 2.4.3 IDRC 2.40 ` After BIOS 2.12 IDRAC 2,75 [18:25:38] 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul) [19:28:29] We were paged for systemd on labsdb1009 because wmf-pt-kill was disconnected. I'm going to make a ticket for it, but that's a deprecated server, and the database looks ok on a quick check. I'm not sure it's a huge deal, but it seems worth documenting at least [19:30:25] Oh! I see why. mysqld seems to have restarted. Well, I'll make a ticket anyway [19:35:40] Unfortunately, that means replication is stopped and few workloads, if any have migrated to the new cluster :( [20:13:41] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) I have started replication on the host and as soon as I did that, mysql crashed again with the same error so it is not looking great. I will dig into this tomorrow.... [20:14:55] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) I have started a table check across all databases, that will likely take more than 12h to complete. Let's see what it reports tomorrow morning. MySQL is up but repl... [20:20:02] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Bstorm) I'll depool the server since it isn't stable or up to date.