[05:04:03] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) p:05Triage>03High a:03Cmjohnson @Cmjohnson can we replace this as soon as possible? This is enwiki primary master [05:08:02] banyek|away: so what happened with the semisync check in the end? [07:30:23] good morning [07:30:43] What happened: I'll starting with that now [07:30:50] great [07:31:21] Let's also try to get the schema change plan first draft ready during the week, so we can start discussing it too [07:31:40] ok [07:32:04] as I expect it to take long until we can actually start deploying it ;) [07:43:28] I am checking this semi-sync replication thing, and I have to ask was that set up in the past at all? [07:43:35] I mean a lot's a servers missing it [07:44:00] I mean a lot's a servers missing it- that is on purpose- see puppet [07:44:14] ok, cool [07:44:34] I am just trying to check it with the 'slaves' on tendril [07:44:46] I have to read (later) about this [07:45:07] I am using this: `SHOW VARIABLES LIKE 'rpl_semi_sync_%_enabled` [07:45:34] and the funny part that there are servers where the result is nothing - that's ok those servers are old like dbstore1002 [07:45:44] the most of the servers have both (master|slave) [07:45:48] which is also ok [07:46:07] but there are servers where only the one of those variables exist [07:46:33] ```db1124 [07:46:33] rpl_semi_sync_master_enabled OFF [07:46:33] db1119 [07:46:33] rpl_semi_sync_slave_enabled ON [07:46:33] db1118 [07:46:33] rpl_semi_sync_master_enabled OFF [07:46:33] rpl_semi_sync_slave_enabled ON``` [07:46:39] like those ^ [07:47:40] you are comparing apples and oranges [07:47:47] db1118 is not a core host [07:48:07] and db1124 is sanitarium [07:48:17] so neither [07:48:57] you probably want to us cumin aliases to narrow things down [07:49:12] yeah, probably [07:49:32] but now my problem was just this why the output there wasn't like [07:49:38] ```db1124 [07:49:38] rpl_semi_sync_master_enabled OFF [07:49:38] rpl_semi_sync_slave_enabled. OFF [07:49:38] db1119 [07:49:38] rpl_semi_sync_master_enabled OFF [07:49:38] rpl_semi_sync_slave_enabled ON [07:49:38] db1118 [07:49:39] rpl_semi_sync_master_enabled OFF [07:49:39] rpl_semi_sync_slave_enabled ON [07:49:40] db1114 [07:49:40] rpl_semi_sync_master_enabled OFF [07:49:41] rpl_semi_sync_slave_enabled ON [07:50:05] check the default values for those roles in puppet [07:50:11] (actually I am playing with the ./software/dbtools/section 's output [07:52:59] yeah, but that includes all the hosts in a section, doesn't matter if they are sanitariums or labs or tests host [07:54:40] ```or i in 080 083 089 106 114 119; do /usr/local/sbin/mysql.py -BN -h db1$i -e "SHOW VARIABLES LIKE 'rpl_semi_sync_slave_enabled'"; done``` [07:54:46] *for [07:54:49] that works [07:56:40] "that includes all the hosts in a section" more like, that includes a full replica set, independently of the role [07:56:50] it was done thinking on schema changes [07:57:09] I can add "groups" of servers if that is helpful [07:57:21] it could be [07:57:46] for example, s1 will include labsdb1009/10/11 [07:57:53] that is exactly what I said ;) [07:58:08] which should have a different configuration [07:58:14] All hosts == all replicas [07:59:02] so s1 is a name of a core section, but also the replica set of all servers replicating enwiki, even those outside of core [08:00:03] We are talking about the same thing anyways, it is a wording thing [08:09:14] I am done from s1 to s8 all semi-sync slaves are good there [08:11:11] great! [08:11:47] changing subject, what's the plan with the logrotate for pt-kill? [08:13:27] 1, I'll 'ack' the string 'logrotate' in the puppet repo [08:13:44] 2, I'll 'find . | grep logrotate' in the puppet repo [08:14:02] probably we have custom set up logrotate [08:14:35] then I'll take my vm to check if a logrotate works with wm-pt-kill (reloads the daemon, things like that) [08:15:27] then I start implementing it based on a prevoiusly created logrotate config (maybe that will be the moment when I'll check if it is not better if I package the logrotate.d/wmf-pt-kill script next to the debian package [08:15:32] 3, ?????? [08:15:34] 4, profit [08:15:58] something like that marostegui: ^ [08:16:03] create a ticket for that if you start workin on that [08:16:10] yeah [08:16:12] ok I'll do it [08:16:15] so you can track work and share things [08:16:16] and close the other one [08:16:28] that's why I didn't closed the old one yet - not until the new ticket is ready [08:16:35] what is the status of pt-kill puppetization [08:16:47] is it does, except for polish like that? [08:16:50] *done [08:16:59] yes [08:17:08] cool, thank you [08:17:27] it was really fun to build it actually, so np :) [08:17:43] we may do at a latter time a productization for other servers [08:17:46] but nothing planned [08:19:53] what's the supposed outcome of the non- [08:20:11] non-'s' sections with the semi-sync? [08:21:00] banyek: if you will work on the background on the pt-kill, can you coordinate with papaul for: https://phabricator.wikimedia.org/T205257#4637549 ? We'd need to new if that BBU is usable or what's the deal with it [08:21:19] banyek: Take a look and let me know if you have doubts about the plan [08:21:24] banyek: check puppet- as far as I know, no other servers other than core use semisync [08:32:51] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Banyek) a:03Banyek @Papaul or you can coordinate with me, I'll be here all week [08:33:55] banyek: you got what I wanted to do? [09:02:18] marostegui: to put the spare BBU in to the db2042 and see what happens, will it work, or will not [09:02:38] yep, that's it [09:02:46] have you checked the role for db2042? [09:02:47] I spent the last 30 minutes on the UPS site, it seems I won't get my work computer today [09:03:00] the address is wrong, but I don't have the InfoNotice to change it [09:03:19] so we wait until the globe turns a bit [09:03:20] :) [09:03:22] I guess they will leave it on a UPS picking point? [09:05:03] i a not sure if there's any [09:05:17] *am [09:06:13] I registered, and if I get that code I can rewrite the address so no worries - until I don't have some adapter for the monitor I can't use it anyway [09:06:26] yeah [09:06:27] too bad [09:07:45] so, be aware of db2042's role [09:17:56] I don't see it in the config anywere [09:18:00] *anywhere [09:19:30] Do some research about it [09:19:47] As you will be powering it down, you need to understand what are the implications and possible impact [09:21:05] the host itself says 'm3' (pahbricator) [09:21:35] and as I see db2078 replicates from it [09:21:58] right, it has an active replica [09:22:31] is it active? can it be powered down any time? [09:23:42] hmm... reading puppet I'd say 'it can be powered down after failing back to eqiad, and after backups are complete' [09:24:02] failing what back to eqiad? [09:24:50] backups: correct, you need to wait for the backups, or run them manually if you power it down while they are supposed to run [09:26:34] I mean dc switchover [09:26:44] is db2042 active? [09:28:32] well it acts as a master for db2078's m3 instance [09:29:09] Yeah, but m3 has also stuff in eqiad, no? [09:29:17] Which is the active master for m3? [09:29:48] but as I see db1117 has phabricator queries [09:29:54] yep [09:30:12] db1117 is the active one, so we only need to care about backups [09:30:18] what does db2078 then? [09:30:30] db1117 is active? for what? [09:32:37] how could you check which is the active master for a misc service? [09:33:16] I now checking the dbproxy hiera's and there it seems like a it's a master secondary [09:33:32] so which is the active master for m3? [09:34:05] db1063 based on the proxy config, but I am not sure b/c it is not shows that way on tendril [09:34:20] db1063? for m3? [09:34:36] check again ;) [09:36:47] https://www.irccloud.com/pastebin/LU6odS5Z/ [09:36:56] BUT I think the port number is the key [09:37:07] why are you checking dbproxy1001? [09:38:10] How did you end up checking dbproxy1001 is more the question [09:38:22] it was the first I checked from the output of 'ack' ing db1117 - but now I see that I have to find the correct port number for the instance of m3 on 1117 and check that proxy [09:38:25] where it appears [09:38:41] So, let's rewind a bit [09:38:47] stop [09:38:50] hammertime [09:38:54] give me a sec [09:39:04] Do you know how to check which is the active master for a misc service? [09:39:21] I said wait a bit: I am on trail [09:40:08] What I am trying to do us is understand how you are reasoning to help you achieve it by yourself instead of me telling you exactly how to do it [09:40:38] SO [09:42:27] after checking that the m3 instance of db1117 (/run/mysqld/mysqld.m3.sock) bound to port 3323 I checked the db1117:3323 in puppet. It seems that the dbproxy1003 and dbproxy1008 handles it, and their config says that the primary is 1072. I am pretty sure btw that the '3' in 'dbproxy1003' represents m3 [09:42:56] correct [09:43:01] so why checking dbproxy1001? [09:44:54] b/c first I didn't remembered that the 'M' is behind proxies, and that was the first element in the list which came up for searching the string db1117 in puppet config - puppet is the source of truth, and if I don't know exactly what I am looking for, I have to check file-by-file first [09:45:22] Do you know the CNAMEs m1-master, m2-master, m3-master etc? [09:45:49] root@neodymium:~# host m3-master [09:45:49] m3-master.eqiad.wmnet is an alias for dbproxy1003.eqiad.wmnet. [09:46:11] So we are trying to determine if db2042 is the active master or which host is the active master [09:46:14] for m3 [09:46:16] right? [09:47:35] now as you ask them I remember them, but it wasn't in my head [09:47:54] right [09:47:57] What does tendril say for m3? [09:48:01] Did you check the tree? [09:49:57] I checked but the tree is not shows the proxy part [09:50:12] or does? [09:50:20] so, according to tendril which is the master for m3? [09:50:47] db1072 [09:50:55] does that match what the proxy says? [09:51:12] yes [09:51:31] then db1072 is the master for m3? [09:51:35] but the secondary master is db1117:3323 according to proxy [09:51:47] that's ok, no? as we are not touching it [09:52:30] that's ok in this case for us [09:52:56] so, what's the only thing you need to be aware when deciding when to poweroff db2042 then? [09:53:48] does the backup running or not (and when it ran last time) I guess. But I am still not sure the role of db2078 then [09:54:01] which is replicating from db2042 [09:54:09] for god knows why [09:54:16] (actually puppet probably knows) [09:54:19] why do you think it replicates from db2042? [09:55:33] because on tendril it looks like [09:55:46] and I see that there's a host replicating from db2042 [09:56:01] yeah, but what's the purpose you think? [09:56:07] with server_id 180367445 [09:56:33] yes, db2078 replicates from db2042, but why do you think it does? [09:56:35] if I didn't know that the backups are running from here I'd guessing backups [09:56:53] right, backups doesn't run there, so why do we have it? [09:57:48] some kind of archive, maybe? [09:58:13] What if db2042 dies while you are doing the BBU operation? [10:00:12] setting up db2078's m3 instance as a replica of db1072.eqiad.wmnet and move the backup's there? [10:00:32] yep, so why do we have db2078 there then? :) [10:00:38] as well as the db proxies [10:00:50] as a 'safety net' host [10:00:56] There you go [10:01:00] I which I have a better word for it [10:01:00] If eqiad dies completly [10:01:11] You want to have redundancy in codfw, no? :-) [10:01:13] *wish [10:01:17] yep [10:01:27] So that is the idea of having db2078 there as a slave [10:01:31] To replicate what we have in eqiad [10:02:05] Wasn't this long excercise better than just me telling you that db2042 is not active and it is only used for backups? [10:02:26] yea [10:02:41] it was [10:02:53] So now, you need all the information you need to coordinate with papaul and to keep in mind what to do if db2042 never comes back [10:03:18] And if you still have doubts, ask [10:03:30] ok [10:04:04] I guess I don't really have to worry about the binlog file/pos, because of the gtid [10:04:17] Do you have this repo cloned? https://gerrit.wikimedia.org/r/#/admin/projects/operations/dns [10:04:23] it can be useful to look for things too [10:05:01] now I have it [10:05:09] If I were you, I would stop replication on db2042, grab binlog coordinates in case you need them for db2078 [10:05:36] Before shutting it down, I mean [10:12:06] 👍 [10:36:35] 10DBA, 10User-Banyek: Solve logrotating on wmf-pt-kill - https://phabricator.wikimedia.org/T206521 (10Banyek) [10:37:54] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807 (10Banyek) [10:37:57] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Banyek) 05Open>03Resolved [10:39:50] I am going to have some lunch [10:42:24] quick question, is it ok to run the warmup caches script in eqiad right now or there is something going on that you prefer not to? [10:43:00] marostegui, jynus ^^^ [10:43:01] volans: it is ok [10:43:09] perfect, thanks! [10:43:24] thank you! [10:45:11] we were waiting for you to actually do it! [10:45:18] :) [11:45:44] 10DBA, 10User-Banyek: Solve logrotating on wmf-pt-kill - https://phabricator.wikimedia.org/T206521 (10Banyek) p:05Triage>03Normal [13:22:04] I have completed the maintenance steps before and after switch on etherpad [13:23:19] for the wikis move? [13:24:12] completed as in, filled in the steps and correcting them [13:24:32] lines 19-25 [13:24:45] checking [13:24:48] banyek: ^ [13:26:27] jynus: shouldn't we do line 25 before 24? [13:26:41] sure, I added that [13:26:54] it wasn't before [13:26:57] Ah sorry :) [13:28:18] mmm line 22, that patch is no longer scheduled at least I cannot see it on deployments [13:29:00] should we deploy that ourselves after the failover? [13:29:25] or before, I don't mind [13:36:26] jynus: do you want me to create a calendar event to grab db1070's position right before the switch so we don't forget? [13:37:00] no worries, we we'll see it on the binlogs [13:37:37] yeah, I know, I just thought running show master status\G would just be easier than digging thru the binlogs [14:25:26] Guys, I have copied our etherpad stuff to the SRE one [14:25:38] thanks [14:44:03] heads up: I finish the day today, and go for the kids. See you tomorrow, but I'll only show up between 10:30 and 11:00. (If the roof is on fire you can reach me.) [14:45:33] see you [15:04:19] 10DBA, 10MediaWiki-Watchlist, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Wikimedia-production-error: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit - https://phabricator.wikimedia.org/T171898 (10kostajh) > That seems like a bug not a feature: a batched DELE... [15:22:15] 10DBA: Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) - https://phabricator.wikimedia.org/T206204 (10jcrespo) Testing mariabackup, got it compiler so far for buster: ``` # /opt/wmf-mariadb103/bin/mariabackup --help /opt/wmf-mariadb103/bin/mariabackup based on MariaDB ser... [15:25:00] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [15:25:26] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [15:41:01] 10DBA: Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) - https://phabricator.wikimedia.org/T206204 (10jcrespo) Basic test (no data) work as xtrabackup, but being linked with mariadb server, it doesn't crash: ```lines=10 jynus@sangai:/srv/tmp$ sudo /opt/wmf-mariadb103/bi... [15:45:47] 10DBA: Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) - https://phabricator.wikimedia.org/T206204 (10jcrespo) Basic streaming works, too: ```lines=10 root@sangai:/srv/tmp# /opt/wmf-mariadb103/bin/mariabackup --backup --user=root --stream=xbstream | pigz -c | pv > backu... [16:17:51] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [16:29:54] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [16:39:39] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) Server is racked in B4 switch port information : asw-b4-codfw ge-4/0/0 IP address: 10.192.16.34 [16:45:22] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) ``` papaul@asw-b-codfw> show interfaces ge-4/0/0 descriptions Interface Admin Link Description ge-4/0/0 up up db2096 ``` ``` in... [16:47:52] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [17:26:16] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Cmjohnson) @Marostegui disk swapped [17:32:14] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) The battery was sent to our old office address in San Francisco, they are shipping a new battery...because it's a battery it has to go ground and will... [17:38:50] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Cmjohnson) it is a new disk...trying it again [17:39:15] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Cmjohnson) new disk...trying it again [18:01:14] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) [18:15:52] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) Failed: ``` PD: 1 Information Enclosure Device ID: 32 Slot Number: 7 Drive's position: DiskGroup: 0, Span: 1, Arm: 1 Enclosure position: 1 Device Id: 7 WWN: 5000C50070CACB6C Sequence Number: 2... [18:30:00] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [22:28:14] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) [22:33:00] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Papaul) a:05Papaul>03Marostegui @Marostegui All yours. [22:41:45] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) The fully implemented secondary schema is ready for #techcom review: https://gerrit.wikimedia.org...