[04:45:45] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @Marostegui clouddb1021 has the analytics_multiinstance role applied, is configured to expect data on s1 and s3 only (https://gerrit.wikimedia.org/r/c/...
[05:14:50] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Thanks @razzi. I have ack'ed the alerts on icinga and I will start working with these sections. If I successfully get them up and running today I...
[05:16:32] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) db2102 is getting some of its tables fixed (rebuilt)
[05:17:17] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) Checking tables on db1175 now
[05:55:05] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s1 and s3 are now replicating on clouddb1021 (pending enabling GTID - will do it once replication is in sync). Host added to tendril and to zarcil...
[05:58:22] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Changed clouddb1021 from planned to active on netbox.
[06:44:17] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) I have adjusted a bit the buffer pool sizes. They might need further changing but we'll only know once the sqoops run.
[06:48:02] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui)
[06:49:22] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui)
[06:51:11] <elukey>	 marostegui: hola hola
[06:51:18] <elukey>	 thanks a lot for clouddb1021
[06:52:30] <elukey>	 I am a bit confused about the firewall issue, in theory clouddb1021 should be in the private vlan
[06:54:06] <elukey>	 so telnet -4 clouddb1021.eqiad.wmnet 3311 works
[06:54:30] <elukey>	 I think that i tries with ipv6 first and hangs without it
[06:54:56] <elukey>	 going to remove the IPv6 entries in netbox then
[06:55:25] <marostegui>	 I can do that if you like :)
[06:57:30] <elukey>	 done, running netbox's cookbook :)
[06:58:42] <marostegui>	 fingers crossed
[07:03:23] <elukey>	 marostegui: done
[07:03:40] <elukey>	 should we leave the ipv6 interface on the host or do you also need that to be removed?
[07:04:09] <marostegui>	 The interface can stay, that's no problem
[07:04:15] <marostegui>	 Let's see what cumin does now!
[07:04:36] <marostegui>	 I guess it will take a few minutes for the change to fully spread
[07:05:01] <elukey>	 yeah I was about to say, it still shows the AAAA record, the resolvers are still not up to date
[07:05:02] <marostegui>	 yep, still resolving the ipv6
[07:05:03] <marostegui>	 yeah
[07:05:06] <marostegui>	 # telnet clouddb1021.eqiad.wmnet 3311
[07:05:06] <marostegui>	 Trying 2620:0:861:101:10:64:0:118...
[07:05:51] <marostegui>	 When telnet fails to ipv4 it does work: Trying 10.64.0.118...
[07:05:51] <marostegui>	 Connected to clouddb1021.eqiad.wmnet.
[07:05:51] <marostegui>	 Escape character is '^]'.
[07:05:55] <marostegui>	 so looking good for now
[07:06:01] <marostegui>	 let's wait then, thanks elukey 
[07:06:15] <elukey>	 while we wait, ignorant question - did you have to go through the usual boilerplate steps to bootstrap a mariadb instance (removing remote root, tables, etc..) or due to the copy was already "done" ?
[07:06:31] <marostegui>	 elukey: No, cause I am copying the data dir from another host
[07:06:40] <marostegui>	 But I do have to go thru configuring replication
[07:06:49] <elukey>	 yes yes that makes sense
[07:11:50] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui)
[08:01:33] <moritzm>	 marostegui: the stretch linux 4.9 update has now been released (albeit there's no DSA yet due to arm builds being ongoing), I'm running some tests and will install fixed kernels fleet-wide once done
[08:02:14] <marostegui>	 Oh nice!
[08:02:17] <marostegui>	 Excellent news
[08:16:56] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s2 and s7 are up and running
[08:18:55] <elukey>	 nice ---^
[09:17:27] <jynus>	 m1 backups are alerting because bacula drop, I will ack them for a week
[09:17:53] <marostegui>	 thanks
[09:19:08] <jynus>	 backup2002 is low on space, I will see if I can delete something
[09:23:00] <jynus>	 I got again the error "Could not read data from wikidatawiki.blobs_cluster26: Lost connection to MySQL server during query"
[09:24:12] <jynus>	 from both clients at the same time, so most likely backup host network
[09:27:37] <jynus>	 something weird happend at 2:54 on the backup host
[09:33:25] <elukey>	 marostegui: +1ed for s4 and s6
[09:33:33] <marostegui>	 thanks :*
[09:34:01] <moritzm>	 marostegui: works fine in my tests, I have deployed 4.9.258 fleet-wide, maybe reboot 1-2 other less critical mysqls before the master, though
[09:34:41] <wikibugs>	 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui)
[09:34:55] <wikibugs>	 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) I have scheduled this for 23rd March at 06:00 AM UTC
[09:35:01] <marostegui>	 moritzm: will do, thank you!
[09:39:52] <jynus>	 fyi, I am running es backups now, they have very limited concurrency so no impact should happen, but notifying here as normally they run over the night
[09:40:08] <jynus>	 also they are the codfw ones
[09:40:18] <jynus>	 so not impact to production traffic either
[09:42:05] <wikibugs>	 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui)
[09:42:51] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on es2025 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2025&var-port=9104
[09:43:17] <jynus>	 oh, maybe there is some impact
[09:44:17] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on es2025 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2025&var-port=9104
[09:48:49] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) After s6 I will go for s7 as we need to switch that master in two weeks, we might as well get this schema change fully done on the master too
[09:48:52] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) After s6 I will go for s7 as we need to switch that master in two weeks, we might as well get this schema change fully done on the master too
[10:23:26] <wikibugs>	 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) When doing a reboot for T273280 I just ran into this issue with db2073. db2072 (which as a newer firmware rebooted just fine).
[10:25:02] <wikibugs>	 10DBA, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui)
[10:25:21] <wikibugs>	 10DBA, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) p:05Triage→03Medium
[10:31:06] <Amir1>	 I'm running drift checker from the abstract schema for the first time and it's a mess 🙃
[10:31:36] <Amir1>	 Lots of tickets will be created soon
[10:32:34] <marostegui>	 haha
[11:28:39] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1147 is CRITICAL: 17.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104
[11:29:49] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1147 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1147&var-port=9104
[12:34:59] <marostegui>	 moritzm: I have upgraded the stretch kernel on 6 codfw hosts, one of them didn't come back, as we spoke in private most likely cause of  T216240, let's see how the others do in the next few days
[12:35:00] <stashbot>	 T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240
[12:35:49] <marostegui>	 If they are ok, I will upgrade the candidate master for s7, as that is the switchover I have scheduled for 23rd March
[12:36:07] <marostegui>	 And I guess at somepoint we need the list of hosts in T216240 to get their firmware upgraded
[12:36:29] <marostegui>	 At least the candidate masters
[13:04:03] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s6 eqiad progress:  [] db1085 sanitarium master, api [] db1088 basic [] db1096:3316 [] db1098:3316 [] db1113:3316 dump,vslow [] db1125:3316 sanitarium [] db1131 master [] db1139 backup s...
[13:04:26] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat)
[13:06:46] <marostegui>	 kormat: ^ I am deploying a change on s6 too in eqiad, I am doing db1098 an db1096 in a bit, after that I will not do more hosts today
[13:09:41] <kormat>	 marostegui: perfect. that'll make it double-effecient
[13:10:10] <marostegui>	 kormat: do you want me to ping you when I depool those two?
[13:10:52] <kormat>	 marostegui: sounds good, thanks :)
[13:17:11] <marostegui>	 kormat: want to deploy your change on db1098:3316?
[13:17:47] <marostegui>	 I will deploy mine after yours
[13:17:54] <marostegui>	 db1098:3316 is depooled
[13:18:01] <kormat>	 marostegui: your commit message said db1198. tricky.
[13:18:05] <kormat>	 going now
[13:18:07] <marostegui>	 typooo
[13:18:36] <kormat>	 marostegui: done
[13:18:52] <marostegui>	 ok, I will deploy mine now, and then start repooling, will take like 1h
[13:18:57] <marostegui>	 the whole process I mean
[13:20:56] <kormat>	 👍
[13:44:15] <wikibugs>	 10DBA, 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) s4 and s6 are now replicating
[13:44:55] <elukey>	 \o/ --^
[13:45:34] <marostegui>	 \o\ |o| /o/
[13:48:42] <jynus>	 great work, everybody involved
[13:49:51] <jynus>	 cpu usage profile on backup2001 and backup2002 is very different. I need more reasearch why (misconfiguration or traffic-related?).
[13:50:06] <jynus>	 sorry, I meant backup 2002 vs backup1002
[13:51:31] <jynus>	 ah, it is missconfiguration, and that is why lag was created before
[13:52:08] <jynus>	 and why I may have connection errors
[13:52:32] <marostegui>	 what thing was misconfigured?
[13:52:50] <jynus>	 I am going to send a patch and add you are reviewer in 1 sec :-D
[13:53:00] <marostegui>	 \o/
[13:58:54] <jynus>	 marostegui, https://gerrit.wikimedia.org/r/c/operations/puppet/+/670174
[13:59:03] <marostegui>	 yeah, just commented :)
[14:44:30] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2090 is CRITICAL: 28.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104
[15:06:23] <wikibugs>	 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui)
[15:17:34] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2090 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2090&var-port=9104
[17:39:26] <wikibugs>	 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate the need to generate and maintain zookeeper backups - https://phabricator.wikimedia.org/T274808 (10Ottomata) Hello! Why not eh?  I don't think this is high priority for us, but it is certainly a good idea.  Feel free to reach out to any Data/Analytics E...
[18:25:35] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Papaul) 05Open→03Resolved Complete Before  ` BIOS 2.4.3 IDRC 2.40 ` After BIOS 2.12 IDRAC 2,75
[18:25:38] <wikibugs>	 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul)
[19:28:29] <bstorm>	 We were paged for systemd on labsdb1009 because wmf-pt-kill was disconnected. I'm going to make a ticket for it, but that's a deprecated server, and the database looks ok on a quick check. I'm not sure it's a huge deal, but it seems worth documenting at least
[19:30:25] <bstorm>	 Oh! I see why. mysqld seems to have restarted. Well, I'll make a ticket anyway
[19:35:40] <bstorm>	 Unfortunately, that means replication is stopped and few workloads, if any have migrated to the new cluster :(
[20:13:41] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) I have started replication on the host and as soon as I did that, mysql crashed again with the same error so it is not looking great. I will dig into this tomorrow....
[20:14:55] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) I have started a table check across all databases, that will likely take more than 12h to complete. Let's see what it reports tomorrow morning. MySQL is up but repl...
[20:20:02] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Bstorm) I'll depool the server since it isn't stable or up to date.