[08:14:09] can we move db1063 and db1051 ? [08:14:24] yep [08:14:31] db1113 is doing fine, no errors, so go ahead [08:14:40] I will take care of that [08:14:44] thank you <3 [08:15:12] vslow are slow to depool [08:49:53] You added "# to be moved to m5" to db1051, that is a misspeling of m1/m2 right? [08:50:52] correct, sorry [08:50:54] will ammend [08:51:00] don't worry [08:51:05] I am working on it [08:51:15] just wanted to be sure I was on the right track [08:51:49] I like to double check everything sorry I am annoying [08:52:09] where's the m5 on the commit? [08:52:27] I am working on it, will ask for review [08:53:06] you are not annoying at all, I don't trust myself so it is perfectly possible I have misspelled things and all that :) [09:00:41] 10DBA: Check data consistency across production shards - https://phabricator.wikimedia.org/T183735#4045776 (10Marostegui) [09:00:48] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4045777 (10Marostegui) [09:01:16] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#4045772 (10Marostegui) 05Open>03Resolved user table is done. So I consider this ticket resolved. _All_ the tables have been checked by using: - leftovers from pt-table-checksum - By running mydumper... [09:01:18] 10DBA: Check data consistency across production shards - https://phabricator.wikimedia.org/T183735#3862188 (10Marostegui) [09:01:32] 10DBA: Check data consistency across production shards - https://phabricator.wikimedia.org/T183735#3862188 (10Marostegui) All the sections have been checked. Some of them were checked quite sometime ago, so I am going to run some more checks before considering this resolved. [09:38:01] 10DBA, 10Goal, 10Patch-For-Review: Decommission database hosts <= db2031 (tracking) - https://phabricator.wikimedia.org/T176243#3618669 (10Marostegui) The DBA side is all done - only pending the DC Ops steps for: T187886 T187543 T187768 [09:48:12] 10DBA, 10Analytics, 10Collaboration-Team-Triage, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623#2080811 (10Marostegui) @Milimetric with all the clean up work done on EL servers, is this still a valid task? [10:11:56] about to trash db1051 and 63 [10:12:14] just checking there is nothing to keep there [10:18:47] good to go [10:25:57] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4045920 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1063.eqiad.wmnet'] ``` The log... [10:26:39] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4045922 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1051.eqiad.wmnet'] ``` The log... [10:30:29] I will try to do the proxy failover at the same time than the database one [10:30:41] ah, nice [10:43:34] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Epic: Make wb_terms table fancy - https://phabricator.wikimedia.org/T188992#4045983 (10Lucas_Werkmeister_WMDE) [10:45:20] 10DBA: 66 rows from external storage (dewiki) gave duplicate key errors on master failover - https://phabricator.wikimedia.org/T152385#4045991 (10Marostegui) 05Open>03Resolved a:03Marostegui As I am with the "checksumming" state of mind I have also taken care of this. The table is consistent across all the... [11:26:50] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1063.eqiad.wmnet'] ``` The log... [11:27:10] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046130 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1051.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1051.eqiad.wmnet'] ``` [11:31:21] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046150 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1051.eqiad.wmnet'] ``` The log... [11:33:39] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046158 (10jcrespo) Installing...{F15259084} [11:33:58] pfff [11:45:46] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4046220 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` and were **ALL** successful. [11:45:57] \o/ [11:50:14] can you reconnect db1009 as a replica, and we do some pt-table-checksum on all m hosts? [11:50:27] (asuming they use statement) [11:50:52] what? [11:51:24] db1009, can it be reconnected to db1073? [11:51:33] it is connected already [11:51:36] ok [11:51:44] it is replicating, but only listening on 127.0.0.1 [11:51:49] I did that on purpose [11:52:02] do you know if replication is statement or row? [11:52:09] I think it is statement [11:52:12] let me check [11:52:27] mixed [11:52:53] I would like to run pt-table-checksum on the current master (we don't care if creates lag) [11:53:12] not only on that, mostly for m2 and m1 [11:53:18] sure, you want me to restart db1009 to listen on 0.0.0.0? [11:53:54] no need, I can query the replica once it finishes [11:53:56] Not sure if it is worth doing it for m5 really, but if you want to include it on your checks, fine by me [11:54:06] I want it mostly for otrs [11:54:10] 10DBA, 10Analytics, 10Collaboration-Team-Triage, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623#4046264 (10Milimetric) It looks to me like those tables still exist and there's still data on the box that analytics-slave points to, so yeah, I think they... [11:58:00] I was going to copy db2044 contents and thinking? A codfw host that has never actively used? Probably drifted a long time ago [11:58:02] anyways, to make it listen on 0.0.0.0 just restart mysql, as I added it on the my.cnf and then puppet wiped it, so it should be good if you just restart :) [11:58:26] no need, I will run it a full speed without waiting for any replica [11:58:51] haha cool [12:37:43] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4046393 (10mark) >>! In T184666#4033042, @kaldari wrote: > @jcrespo: One thing that hasn't been mentioned is that Glo... [13:52:02] 10DBA, 10Analytics, 10Collaboration-Team-Triage, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623#4046608 (10mforns) The Echo schema is present in EventLogging's purging white-list, see: https://github.com/wikimedia/puppet/blob/production/modules/profile... [13:52:42] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#4046610 (10chasemp) 05Open>03stalled p:05Normal>03Lowest [15:42:48] I am thinking of a strategy for proxy failover, but I want to know your opinion [15:43:20] I am thinking of failing over the dns to point to the new and updated ones [15:43:22] shoot [15:43:39] ah, and then working on the proxies without any issues [15:43:46] and even if dns takes some time to propagate or connections don't get immediately changed [15:43:50] yeah [15:43:59] both proxies poiunt to the same server [15:44:05] indeed [15:44:07] if adter a day, that is still an issue [15:44:23] we force a reload before/after the master failover [15:44:34] you can also change dns to point to the db itself [15:44:37] which we will do with strict read-write-only-for one [15:44:56] why? [15:45:03] it is just another option [15:45:13] the problem here are not the proxies themselves [15:45:21] but the apps not reloading the connections often enough [15:45:58] and we have already redundant proxies, the problem is the migration [15:46:31] then tomorrow we can check if there are connections coming from the wrong proxy and reload them [15:46:38] sounds good to me [15:46:50] and maybe aim for a thursday double failover? [15:46:52] you think they will last for hours? [15:46:59] I don't know [15:47:06] dns changes in seconds/minutes [15:47:14] but app can have any logic [15:47:29] from never reconnect to do it every few seconds [15:47:34] I would do one failover per day instead of both of them [15:47:49] then tomorrow and thursday? [15:47:54] sounds good [15:47:57] thursday and friday? [15:48:08] maybe tomorrow if it is ready, to avoid changes on friday [15:48:13] if not, thur and friday looks good [15:48:21] one is ready already [15:48:33] but if we need help [15:48:42] we may need a longer pre-warning window [15:48:59] let's see if akosiaris can do it tomorrow? [15:49:42] ? [15:49:58] * akosiaris reading backlog [15:50:11] akosiaris: how does an misc failover tomorrow sounds to you? :p [15:50:15] it is what we mentioned this morning [15:51:08] I don't know what is the SLA of otrs [15:51:17] in theory this should be seconds of downtime [15:51:31] same for gerrit [15:52:39] not even downtime, just read-only [15:52:46] yeah [15:55:24] and to be fair, for m1 we need more people [15:55:33] network for librenms [15:55:52] the generic apache services nobody really owns [15:56:27] last time we enrolled a few people: https://wikitech.wikimedia.org/wiki/MariaDB/misc [15:56:43] maybe for m1 it is better to send an email to ops [15:56:48] to gather some people [15:56:54] for example, I would wait to see when bacula doesn't have anything running [15:57:05] we can send an email wioth the list of databases and see who speaks up .) [15:58:02] nobody will speak up, it needs a finger [15:58:52] vgutierrez is on duty this week, this is a very formative experience! [15:59:20] so, m2 is the one that can be done tomorrow more or less easily? [15:59:47] m2 is mostly otrs and gerrit [15:59:59] I don't know about the others [16:00:37] yeah, I think we can do m2 tomorrow ,and for m1 I would send an email first, to see if there are some volunteers to help and if not, we can point at people :-) [16:03:22] let me finish otrs cloning and I would send an email with a proposal [16:03:28] sounds good :) [16:03:35] thanks [16:04:45] I would also try the dns change now [16:04:55] sure [16:05:01] once I finish updating the proxies [16:12:04] https://gerrit.wikimedia.org/r/419216 [16:12:20] I've just upgraded and restarted 1001 and 1007 [16:12:38] and they should have the same config than 1006 and 1002 [16:12:44] (The current active ones) [16:12:56] right [16:13:02] i was actually checking the config XD [16:13:06] not because I don't trust you [16:13:12] oh, you were doing good [16:13:13] just to do a proper review :) [16:13:19] yes, please do [16:13:33] I also double checked the active ones with netstat [16:13:45] the bad thing that can happen [16:13:52] looks good to me [16:13:57] is that the master fails, and only 1 detects it [16:14:14] m1 and m2 never have done those [16:14:21] only m5 misbehaved [16:15:40] so, today you upgraded 1001 and 1007 only? [16:15:50] only as: those are the ones you did? [16:16:19] yes [16:16:20] I was updating the paste m0ritzm has to track the affected kernels [16:16:22] cool [16:16:41] I think the replica of m2 may be complaining, right? [16:16:47] becuase I have not put it up yet? [16:17:01] on a proxy level you mean, no? [16:17:05] yes [16:17:08] yeah, it will [16:17:35] I am going to continue [16:17:40] sure [16:19:41] for now, literally no change happened [16:20:40] 2 new connections though 1001 [16:20:44] *through [16:21:22] more through dbproxy1007 [16:21:55] \o/ [18:50:53] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047594 (10RobH) [18:58:59] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047627 (10RobH) [18:59:14] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2030 - https://phabricator.wikimedia.org/T187768#3984964 (10RobH) a:05RobH>03Papaul @papaul: ready for onsite disk wipe [19:26:48] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4047703 (10Papaul) @RobH thanks