[05:11:30] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) [05:19:41] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10Marostegui) This got fixed upstream. ``` root@db1089:~# mysqlbinlog --help | grep ssl --ssl Enable SSL for connection (automatically enabled with --ssl-ca=name C... [05:22:03] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10Marostegui) There is no need to use `--defaults-config=/etc/.my.cnf ` or ` --no-defaults` since 10.0.27. [06:42:43] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) > would you have time to chat on IRC some time today / this week / next week (or the week after Let's... [06:45:48] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10jcrespo) What about 10.1, do you know when it was fixed there? [06:55:38] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10jcrespo) 05Open>03Resolved a:03Marostegui In any case, it is fixed on the latest versions: ``` root@neodymium:~$ /opt/wmf-mariadb101-client/bin/mysqlbinlog --help | grep ssl --... [06:56:43] no idea when it was fixed in 10.1 [07:03:12] should we do the maintenance? [07:04:05] sure! [07:04:10] downtime will be active until 10UTC [07:04:29] and no more lag on dbstore1002 (I think) [07:05:03] no more lag, confirmed [07:05:06] s3, s5 and x1 are good [07:08:02] do I do it or do you, and the other checks? [07:08:15] up to you [07:08:36] I can do the checks if you want [07:08:40] And be the other pair of eyes [07:08:44] I don't mind either way [07:09:11] ok, so the plan is to stop s3, s5 and x1 on eqiad [07:10:10] move the master [07:10:38] yep [07:10:54] should we restart replication at that poing and only stop the sanitariums and dbstore1002 [07:11:06] or keep everthing stopped [07:11:17] ah, we need x1 stopped too for dbstore1002 [07:11:46] so, let's do line 7,8 and 9 no? [07:12:14] yes, but after 9 we could maybe start replication on masters? [07:12:22] I am asking [07:12:38] as we can stop only certain replicas in sync [07:12:40] but we have to remove the labs filters first [07:12:43] yes [07:12:58] do we? [07:13:01] I would start replication at the end [07:13:05] ok [07:13:09] Just to be safe [07:13:10] No? [07:13:27] I don't know, but I agree with your suggestion, so let's do it like that [07:13:31] :-) [07:13:35] less complication [07:13:42] yeah, we start replicatio at line 15 [07:13:45] according to that plan [07:14:23] ok, I wil stop replication when logged [07:14:33] and then help me check everthing is in sync [07:14:55] good [07:15:25] what is the sanitarium name? [07:15:32] 1124? [07:15:33] db1124 [07:15:50] it has s3 and s5 [07:16:03] cool, starting [07:16:06] great [07:17:03] and my connection dropping before starting :-) [07:17:07] haha [07:17:37] I have 5 seconds of lag [07:17:58] now I am back [07:18:06] watch -operations [07:18:09] ok [07:20:47] so I have not stopped s3 on db1070 yet [07:21:09] (as I had to stop db1075 first) [07:21:21] yep [07:21:24] makes sense [07:22:23] they are in sync, but oviously, heartbeat keeps replicating [07:23:07] can you check sanitarium is also "stopped" [07:23:08] cool, so now we move db1070:s3 under s3:codfw? [07:23:09] yes [07:23:11] let me check [07:25:57] I can kill hearbeat if that is easier [07:26:06] No need to [07:26:08] I am checking binlog [07:26:10] give me a sec [07:26:36] s3 is clean, only heartbeat [07:27:34] s5 is also clean, only hearbeat [07:27:34] ok, will stop s3 on s5, make the change but not start, only wait for your ok [07:27:43] cool [07:27:55] so stopping db1070:s3 no? [07:28:05] yep [07:28:49] cool [07:28:50] stopped [07:28:58] I made sure they stopped at the same position [07:29:06] although heartbeat.etc [07:29:09] yeah [07:29:29] but I verified they stopped on the same position on my screen [07:30:07] So, let's move db1070:s3 under codfw but without starting replication? [07:30:13] db2043-bin.004605:70533752 [07:30:16] yep [07:30:18] there^ [07:30:30] cool [07:30:50] making sure I don't touch the main replication [07:30:55] yep [07:31:03] I can check when you are done with it [07:31:13] just to have 2 more eyes looking at it [07:31:39] ofc [07:33:10] I see the changes and the filters are still in place, so all good I think [07:33:13] check connection, filters and position [07:33:20] I didn't rest [07:33:22] *reset [07:33:22] connection is wrong [07:33:26] db2043.eqiad.wmnet [07:33:29] should be codfw [07:33:31] arg [07:33:34] thanks [07:33:40] see why I need someone to check? [07:33:49] filters are good! [07:33:59] check again [07:34:22] looks good now [07:34:56] so we leave replication stopped now [07:35:00] next step [07:35:02] and now stop s3, s5 and x1 on labs [07:35:04] and dbstore1002 [07:35:27] s3, s5 on labs [07:35:33] s5, x1 on dbstore1002 [07:35:39] doing [07:35:41] and s3 on dbstore1002? [07:35:54] no need? [07:36:04] or do we? [07:36:12] I am trying to think why not [07:36:19] should be the same case as labs no? [07:36:24] dbstore1002 is replicating from s5 already [07:36:32] it was imported there [07:36:37] ah true! [07:36:38] yes [07:36:41] we only need to fix x1? [07:36:43] yep [07:36:45] correct [07:36:49] forgot it was reimported there too [07:37:56] I have stopped s3 and s5 on 3 labsdbs [07:38:39] and s5 and x1 on dbstore1002 [07:39:09] confirmed stopped [07:39:20] we can skip for now step 10 [07:39:26] yeah [07:39:28] can be done at any time [07:39:34] let's do the breaking things first [07:39:35] yep, let's go for 12 and 13 [07:39:41] which are the hard ones [07:40:22] so we need to remove the filters from s5 [07:40:27] and put them on s3, right? [07:40:36] just that? [07:40:51] I believe so [07:41:22] we need to ignore the changes coming from s3 [07:41:27] and let the ones coming from s5 replicate [07:41:48] so we have to move the current filters from s5 to s3 [07:41:58] just did that on labsdb1009 [07:42:07] checking [07:42:27] looks good, s5 is clean and s3 has: Replicate_Wild_Ignore_Table: enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.% [07:42:38] which are the imported wikis [07:42:46] if it looks good, I will do the same on the other 2 [07:42:51] it looks good [07:43:39] done [07:43:49] checking [07:44:12] looks good [07:44:24] dbstore1002 remains untouched? [07:44:33] for now [07:44:40] dbstor1002 has an ignore on s3 [07:44:42] and s5 clean [07:44:48] so we are good [07:44:52] (apart from x1) [07:45:00] yes, but we need to stop an x1 host in sync [07:45:06] yeah, we can do that later [07:45:13] and we could import without everthing stopped [07:45:17] yep [07:45:47] we can stop dbstore1001? [07:45:54] for x1? [07:45:56] and restart replication generally [07:45:57] yes [07:46:00] yeah [07:46:01] dbstore1001:x1 [07:46:07] but we can do that later if you want [07:46:07] so no production impact [07:46:20] ah sure [07:46:20] well, we have to do it before restartin replicatin [07:46:24] Yeah, I know what you mean [07:46:24] yep [07:46:26] let's do that [07:46:31] or syncing is harder [07:46:31] so we can restart it yes [07:46:32] good idea [07:46:40] that was my initial suggestion [07:46:47] not have everthing stopped [07:46:49] Yep, good idea [07:46:51] Let's do that [07:46:51] for a long time [07:46:54] and testing [07:46:56] all changes [07:47:26] there is 2 things on dbstore1002 [07:47:34] importing and removing the x1 filters [07:47:47] I will stop dbstore1001:x1 [07:47:53] good [07:47:59] start labsdbs [07:48:08] and finally start all the masters again [07:48:12] correct [07:48:21] leaving only dbstore1002 and dbstore1001 stopped partially [07:48:25] yep [07:48:34] starting labsdbs [07:48:40] good [07:48:41] (s3, s5) [07:49:34] I see it started on 1009 [07:49:41] nothing broken so far? [07:49:45] should we maybe start it only on 1009? [07:49:50] yeah [07:49:52] and then the masters [07:49:57] and see what breaks [07:49:57] we can do that [07:50:02] let me stop x1 on dbstore1001 [07:50:04] so we can reimport if needed [07:50:05] cool [07:51:23] so now I will start all 3 replications @ masters, in any order [07:51:28] cool [07:52:02] so far so good on 1009 [07:52:26] we need to disable gtid [07:52:40] yeah [07:52:59] I can see cebwiki.revision getting new entries on 1009 [07:53:25] 1009 caught up [07:53:39] cebwiki.revision table getting new entries [07:54:42] 1009:s3 acting a bit weird, I guess heartbeat related, sometime show slave status shows 1200 seconds delay, some other times 0 [07:54:46] but so far, replication is good [07:54:51] any other issue? [07:54:54] like, it is not broken [07:54:58] no, just that so far [07:55:06] yeah, that is the multi-source [07:55:17] I realized that depending on where it is replicating [07:55:19] let's leave it a few more minutes I would say [07:55:29] gtid says on or the other [07:55:40] yeah, not worrying [07:55:46] and in this case s3 may catch up faster [07:55:51] as it is only a partial replication [07:56:01] so that is why it flaps [07:56:42] db1070 s5 broken repl [07:56:57] not recovering the alert? [07:57:30] I don't see it live [07:57:37] ah could be [07:57:55] let me reforce the recheck [07:58:18] yeah, it is fine [07:58:21] yeah, it was the old gtid complain [07:58:25] so it was the gtid multisource bug, no? [07:58:32] not a "bug" [07:58:41] according to them is a feature [07:58:49] but yes, that issue [07:59:01] we had to disable gtid anyway [07:59:06] for all masters [07:59:15] so not that we lose anything [07:59:42] the important question is, is labsdb broken? [08:00:15] I will do data checks on both s3 and s5 to be 100% sure there was no issue [08:00:49] oh, I didn't start x1 [08:01:22] nope labs is good [08:01:28] I am checking revision tables across the imported wikis [08:01:31] and they are advancing [08:01:43] I will do a full compare.py with codfw [08:01:47] anyway [08:02:01] let's leave it for 30 minutes or somethin [08:02:01] we have to be 100% sure, even if I am almost already sure [08:02:04] of course [08:02:07] before starting the others [08:02:08] let me start x1 [08:02:11] good [08:03:20] I am also checking filters [08:03:27] triggers [08:03:36] checking alerts everywhere [08:03:50] banyek: can you login to cebwiki so I can check if your user gets filtered correctly? [08:03:52] in case they were downtimed and we didn't notice [08:04:03] banyek: you can help me with ^ [08:04:27] well, one of the 2 :-) [08:04:29] sure [08:05:02] I have to read back, because we were talking with volans, and I don't know what we are talking about [08:05:28] no problem if you are busy [08:05:35] just in case you weren't [08:05:46] we can continue later :) [08:05:50] volans does the actual work, so I can do this :) [08:06:31] banyek: no worries, I just created a user and tested it. [08:06:48] jynus: triggers working fine on cebwiki [08:06:57] jynus: going to do a data check on s5 on db1124 just to be sure [08:06:57] alers are ok [08:07:10] I am going to reimport x1 for these wikis into dbstore1002 [08:07:52] ok, running a private data check on db1124:s5 [08:08:04] while that runs, I will get a quick tea [08:08:43] we are technically finished [08:08:48] just some cleanup [08:09:39] I checked the user table, and seems ok [08:09:45] (labsdb1009) [08:10:56] cebwiki is not small in x1, 5GB, the others are very small [08:25:08] The data check finished correctly on s5 [08:25:14] So triggers working well [08:25:30] I am importing the tables [08:25:33] will take some time [08:25:36] cool [08:25:37] also running compare.py [08:32:44] I am reviewing db-eqiad.php [08:32:52] There are nothing really to change I think [08:32:54] apart from db1092 [08:33:09] Which I merged: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/464753/ eariler today [08:33:17] what about https://gerrit.wikimedia.org/r/463935 ? [08:33:28] Yeah, I meant the weights and all that [08:33:35] I +1ed that one I think already, didn't? [08:33:49] let's merge that on monday? [08:33:50] deploying then [08:34:01] I wanted to leave it during the weekend [08:34:04] ah sure [08:34:05] assuming no errors [08:34:07] to get those [08:34:09] it should be fine yeah [08:34:11] worse case scenario [08:34:19] errors on the read only passive dc [08:34:22] labsdb1009 replication still good [08:36:00] volans had a good idea to refactor the wmf-pt-kill from "profile::labs::db::kill_long_running_queries" "profile::mariadb::kill_long_running_queries" [08:36:12] it feels right to me [08:36:25] I suggested the shorter ::wmf_pt_kill, but up to you :) [08:36:29] it should be in reality in a module [08:36:32] a real one [08:36:44] but I legft it for now inside labsdvb [08:36:51] until tested it can be added somewhere else [08:37:10] so it should be inside mariadb or on its own module [08:37:24] as we shouln't add profiles inside other profiles [08:37:47] jynus: it will be it's own profile, not inside another one [08:38:00] no, its own, top-level module [08:38:23] at the moment we are using ::profile inside a profile, which is not ok [08:38:45] I didn't get your last line [08:38:48] modules/wmf-pt-kill [08:38:50] or [08:39:03] modules/mariadb/manifests/wmf-pt-kill [08:39:28] .pp [08:40:01] modules/profile/foo is perfectly valid and part of the role/profile paradigm [08:40:03] then a very simple profile [08:40:10] it is not ok, see the contents [08:40:18] (I know because I wrote it) [08:40:25] :-P [08:41:49] now I'm totally lost :) [08:42:30] so profile::mariadb::kill_long_running_queries has to happen [08:42:50] but not now, once we check the functionality works outside of labs (and it is useful) [08:43:30] but once it happens, a top level module pt-wmf-kill or part of a module (mariadb) has to contain the main, gneric funcrionality [08:43:43] what's the problem of giving the proper name to a module based on what it does? where you use it is up to you [08:43:49] that thing is already generic [08:44:00] not yet [08:44:00] (or will be with the next patch) [08:44:09] reads variables from hiera and use them for the config file [08:44:13] has nothing labs-specific [08:44:14] for example, the package [08:44:22] has labs specific uses [08:44:31] it cannot be used at the moment outside of labs [08:44:34] and that is ok [08:44:40] for the scope wanted [08:44:58] what do you mean the package has labs-specific things? [08:45:15] imagine I wanted to use that on production? [08:45:25] that package had to be rebuilt [08:45:32] it is not helpful there [08:45:35] at the moment [08:45:49] and that is ok, it is not in scope to create a generic killer [08:46:01] hence the profile::labs::db [08:46:16] it is a labsdb specific solution [08:47:26] in other words. at the moment it is a labsdb-killer [08:49:34] banyek: will you do the labsdb1011 depool today too for brooke? [08:49:43] yes [08:49:48] cool thanks [08:49:51] np [08:50:52] jynus: we were talking about the pt-heartbeat package for the future, do we have a ticket for that (a yes/no is enough, I can create it or search for it)? [08:51:07] not one for productionization [08:51:18] but search for pt-heartbeat-wikimedia [08:51:26] and see the context on how it was created [08:51:26] ok [08:51:31] jynus: personally don't agree, the puppet code is generic IMHO, but name it as you want [08:51:33] and the gerrit patch [08:51:38] 👍 [08:52:27] volans: I find sad that you don't belive when I am saying it is not gneric :' -( [08:52:50] when I am pointing to my own code failures [08:52:58] I trust you that the package is not generic, I'm saying that the puppet side it is [08:53:04] so it could be called with a generic name [08:53:10] ok [08:53:14] and you know that you have to fix other things before using it generically [08:53:16] so this is the rationale [08:53:22] but if you think this might be confusing [08:53:31] the moment it goes to generic killer [08:53:31] keep the labs-specific name for now [08:53:46] analytics (this is an example) will say, hey, let's use it [08:53:55] and other random people [08:54:00] incl labs [08:54:14] it will be confusing [08:54:19] ok [08:54:21] it will be made generic [08:54:26] that is for sure [08:54:33] but not until the package supports it [08:55:03] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) I'd happily get involved in this [08:55:20] ack [08:55:53] it is just a name :-) [08:56:00] but the current was on purpose [08:56:17] "this is for labsdb only" [08:56:35] don't use outside of it [08:57:12] I suffer from the context of the code thing you may not :'-( [09:00:40] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) the compression of the s2 tables finally finished, I'll check the others [09:02:58] marostegui: restarting dbstores replication [09:03:06] \o/ [09:03:09] I've removed the x1 filter [09:03:13] yeah, I was going to ask [09:03:15] great! [09:03:29] and you would be ok asking [09:03:39] I have updated the etherpad [09:03:43] to mark those as done [09:05:20] tendril tree view looks funny now [09:05:45] hahah indeed [09:06:00] s3 never had so many slaves! [09:07:01] banyek: after maintenance finishes we can talk if you want about pt-heartbeat-wikimedia context [09:07:16] but I would put that on a secondary level of priority [09:07:22] yeah, agreed [09:07:27] (nice to have, but nothing is on fire) [09:07:35] but it relates to your work on pt-kill [09:07:49]