[05:11:30] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) [05:19:41] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10Marostegui) This got fixed upstream. ``` root@db1089:~# mysqlbinlog --help | grep ssl --ssl Enable SSL for connection (automatically enabled with --ssl-ca=name C... [05:22:03] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10Marostegui) There is no need to use `--defaults-config=/etc/.my.cnf ` or ` --no-defaults` since 10.0.27. [06:42:43] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) > would you have time to chat on IRC some time today / this week / next week (or the week after Let's... [06:45:48] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10jcrespo) What about 10.1, do you know when it was fixed there? [06:55:38] 10DBA, 10Upstream: mysqlbinlog doesn't recognize the ssl-* options in [client] - https://phabricator.wikimedia.org/T127363 (10jcrespo) 05Open>03Resolved a:03Marostegui In any case, it is fixed on the latest versions: ``` root@neodymium:~$ /opt/wmf-mariadb101-client/bin/mysqlbinlog --help | grep ssl --... [06:56:43] no idea when it was fixed in 10.1 [07:03:12] should we do the maintenance? [07:04:05] sure! [07:04:10] downtime will be active until 10UTC [07:04:29] and no more lag on dbstore1002 (I think) [07:05:03] no more lag, confirmed [07:05:06] s3, s5 and x1 are good [07:08:02] do I do it or do you, and the other checks? [07:08:15] up to you [07:08:36] I can do the checks if you want [07:08:40] And be the other pair of eyes [07:08:44] I don't mind either way [07:09:11] ok, so the plan is to stop s3, s5 and x1 on eqiad [07:10:10] move the master [07:10:38] yep [07:10:54] should we restart replication at that poing and only stop the sanitariums and dbstore1002 [07:11:06] or keep everthing stopped [07:11:17] ah, we need x1 stopped too for dbstore1002 [07:11:46] so, let's do line 7,8 and 9 no? [07:12:14] yes, but after 9 we could maybe start replication on masters? [07:12:22] I am asking [07:12:38] as we can stop only certain replicas in sync [07:12:40] but we have to remove the labs filters first [07:12:43] yes [07:12:58] do we? [07:13:01] I would start replication at the end [07:13:05] ok [07:13:09] Just to be safe [07:13:10] No? [07:13:27] I don't know, but I agree with your suggestion, so let's do it like that [07:13:31] :-) [07:13:35] less complication [07:13:42] yeah, we start replicatio at line 15 [07:13:45] according to that plan [07:14:23] ok, I wil stop replication when logged [07:14:33] and then help me check everthing is in sync [07:14:55] good [07:15:25] what is the sanitarium name? [07:15:32] 1124? [07:15:33] db1124 [07:15:50] it has s3 and s5 [07:16:03] cool, starting [07:16:06] great [07:17:03] and my connection dropping before starting :-) [07:17:07] haha [07:17:37] I have 5 seconds of lag [07:17:58] now I am back [07:18:06] watch -operations [07:18:09] ok [07:20:47] so I have not stopped s3 on db1070 yet [07:21:09] (as I had to stop db1075 first) [07:21:21] yep [07:21:24] makes sense [07:22:23] they are in sync, but oviously, heartbeat keeps replicating [07:23:07] can you check sanitarium is also "stopped" [07:23:08] cool, so now we move db1070:s3 under s3:codfw? [07:23:09] yes [07:23:11] let me check [07:25:57] I can kill hearbeat if that is easier [07:26:06] No need to [07:26:08] I am checking binlog [07:26:10] give me a sec [07:26:36] s3 is clean, only heartbeat [07:27:34] s5 is also clean, only hearbeat [07:27:34] ok, will stop s3 on s5, make the change but not start, only wait for your ok [07:27:43] cool [07:27:55] so stopping db1070:s3 no? [07:28:05] yep [07:28:49] cool [07:28:50] stopped [07:28:58] I made sure they stopped at the same position [07:29:06] although heartbeat.etc [07:29:09] yeah [07:29:29] but I verified they stopped on the same position on my screen [07:30:07] So, let's move db1070:s3 under codfw but without starting replication? [07:30:13] db2043-bin.004605:70533752 [07:30:16] yep [07:30:18] there^ [07:30:30] cool [07:30:50] making sure I don't touch the main replication [07:30:55] yep [07:31:03] I can check when you are done with it [07:31:13] just to have 2 more eyes looking at it [07:31:39] ofc [07:33:10] I see the changes and the filters are still in place, so all good I think [07:33:13] check connection, filters and position [07:33:20] I didn't rest [07:33:22] *reset [07:33:22] connection is wrong [07:33:26] db2043.eqiad.wmnet [07:33:29] should be codfw [07:33:31] arg [07:33:34] thanks [07:33:40] see why I need someone to check? [07:33:49] filters are good! [07:33:59] check again [07:34:22] looks good now [07:34:56] so we leave replication stopped now [07:35:00] next step [07:35:02] and now stop s3, s5 and x1 on labs [07:35:04] and dbstore1002 [07:35:27] s3, s5 on labs [07:35:33] s5, x1 on dbstore1002 [07:35:39] doing [07:35:41] and s3 on dbstore1002? [07:35:54] no need? [07:36:04] or do we? [07:36:12] I am trying to think why not [07:36:19] should be the same case as labs no? [07:36:24] dbstore1002 is replicating from s5 already [07:36:32] it was imported there [07:36:37] ah true! [07:36:38] yes [07:36:41] we only need to fix x1? [07:36:43] yep [07:36:45] correct [07:36:49] forgot it was reimported there too [07:37:56] I have stopped s3 and s5 on 3 labsdbs [07:38:39] and s5 and x1 on dbstore1002 [07:39:09] confirmed stopped [07:39:20] we can skip for now step 10 [07:39:26] yeah [07:39:28] can be done at any time [07:39:34] let's do the breaking things first [07:39:35] yep, let's go for 12 and 13 [07:39:41] which are the hard ones [07:40:22] so we need to remove the filters from s5 [07:40:27] and put them on s3, right? [07:40:36] just that? [07:40:51] I believe so [07:41:22] we need to ignore the changes coming from s3 [07:41:27] and let the ones coming from s5 replicate [07:41:48] so we have to move the current filters from s5 to s3 [07:41:58] just did that on labsdb1009 [07:42:07] checking [07:42:27] looks good, s5 is clean and s3 has: Replicate_Wild_Ignore_Table: enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.% [07:42:38] which are the imported wikis [07:42:46] if it looks good, I will do the same on the other 2 [07:42:51] it looks good [07:43:39] done [07:43:49] checking [07:44:12] looks good [07:44:24] dbstore1002 remains untouched? [07:44:33] for now [07:44:40] dbstor1002 has an ignore on s3 [07:44:42] and s5 clean [07:44:48] so we are good [07:44:52] (apart from x1) [07:45:00] yes, but we need to stop an x1 host in sync [07:45:06] yeah, we can do that later [07:45:13] and we could import without everthing stopped [07:45:17] yep [07:45:47] we can stop dbstore1001? [07:45:54] for x1? [07:45:56] and restart replication generally [07:45:57] yes [07:46:00] yeah [07:46:01] dbstore1001:x1 [07:46:07] but we can do that later if you want [07:46:07] so no production impact [07:46:20] ah sure [07:46:20] well, we have to do it before restartin replicatin [07:46:24] Yeah, I know what you mean [07:46:24] yep [07:46:26] let's do that [07:46:31] or syncing is harder [07:46:31] so we can restart it yes [07:46:32] good idea [07:46:40] that was my initial suggestion [07:46:47] not have everthing stopped [07:46:49] Yep, good idea [07:46:51] Let's do that [07:46:51] for a long time [07:46:54] and testing [07:46:56] all changes [07:47:26] there is 2 things on dbstore1002 [07:47:34] importing and removing the x1 filters [07:47:47] I will stop dbstore1001:x1 [07:47:53] good [07:47:59] start labsdbs [07:48:08] and finally start all the masters again [07:48:12] correct [07:48:21] leaving only dbstore1002 and dbstore1001 stopped partially [07:48:25] yep [07:48:34] starting labsdbs [07:48:40] good [07:48:41] (s3, s5) [07:49:34] I see it started on 1009 [07:49:41] nothing broken so far? [07:49:45] should we maybe start it only on 1009? [07:49:50] yeah [07:49:52] and then the masters [07:49:57] and see what breaks [07:49:57] we can do that [07:50:02] let me stop x1 on dbstore1001 [07:50:04] so we can reimport if needed [07:50:05] cool [07:51:23] so now I will start all 3 replications @ masters, in any order [07:51:28] cool [07:52:02] so far so good on 1009 [07:52:26] we need to disable gtid [07:52:40] yeah [07:52:59] I can see cebwiki.revision getting new entries on 1009 [07:53:25] 1009 caught up [07:53:39] cebwiki.revision table getting new entries [07:54:42] 1009:s3 acting a bit weird, I guess heartbeat related, sometime show slave status shows 1200 seconds delay, some other times 0 [07:54:46] but so far, replication is good [07:54:51] any other issue? [07:54:54] like, it is not broken [07:54:58] no, just that so far [07:55:06] yeah, that is the multi-source [07:55:17] I realized that depending on where it is replicating [07:55:19] let's leave it a few more minutes I would say [07:55:29] gtid says on or the other [07:55:40] yeah, not worrying [07:55:46] and in this case s3 may catch up faster [07:55:51] as it is only a partial replication [07:56:01] so that is why it flaps [07:56:42] db1070 s5 broken repl [07:56:57] not recovering the alert? [07:57:30] I don't see it live [07:57:37] ah could be [07:57:55] let me reforce the recheck [07:58:18] yeah, it is fine [07:58:21] yeah, it was the old gtid complain [07:58:25] so it was the gtid multisource bug, no? [07:58:32] not a "bug" [07:58:41] according to them is a feature [07:58:49] but yes, that issue [07:59:01] we had to disable gtid anyway [07:59:06] for all masters [07:59:15] so not that we lose anything [07:59:42] the important question is, is labsdb broken? [08:00:15] I will do data checks on both s3 and s5 to be 100% sure there was no issue [08:00:49] oh, I didn't start x1 [08:01:22] nope labs is good [08:01:28] I am checking revision tables across the imported wikis [08:01:31] and they are advancing [08:01:43] I will do a full compare.py with codfw [08:01:47] anyway [08:02:01] let's leave it for 30 minutes or somethin [08:02:01] we have to be 100% sure, even if I am almost already sure [08:02:04] of course [08:02:07] before starting the others [08:02:08] let me start x1 [08:02:11] good [08:03:20] I am also checking filters [08:03:27] triggers [08:03:36] checking alerts everywhere [08:03:50] banyek: can you login to cebwiki so I can check if your user gets filtered correctly? [08:03:52] in case they were downtimed and we didn't notice [08:04:03] banyek: you can help me with ^ [08:04:27] well, one of the 2 :-) [08:04:29] sure [08:05:02] I have to read back, because we were talking with volans, and I don't know what we are talking about [08:05:28] no problem if you are busy [08:05:35] just in case you weren't [08:05:46] we can continue later :) [08:05:50] volans does the actual work, so I can do this :) [08:06:31] banyek: no worries, I just created a user and tested it. [08:06:48] jynus: triggers working fine on cebwiki [08:06:57] jynus: going to do a data check on s5 on db1124 just to be sure [08:06:57] alers are ok [08:07:10] I am going to reimport x1 for these wikis into dbstore1002 [08:07:52] ok, running a private data check on db1124:s5 [08:08:04] while that runs, I will get a quick tea [08:08:43] we are technically finished [08:08:48] just some cleanup [08:09:39] I checked the user table, and seems ok [08:09:45] (labsdb1009) [08:10:56] cebwiki is not small in x1, 5GB, the others are very small [08:25:08] The data check finished correctly on s5 [08:25:14] So triggers working well [08:25:30] I am importing the tables [08:25:33] will take some time [08:25:36] cool [08:25:37] also running compare.py [08:32:44] I am reviewing db-eqiad.php [08:32:52] There are nothing really to change I think [08:32:54] apart from db1092 [08:33:09] Which I merged: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/464753/ eariler today [08:33:17] what about https://gerrit.wikimedia.org/r/463935 ? [08:33:28] Yeah, I meant the weights and all that [08:33:35] I +1ed that one I think already, didn't? [08:33:49] let's merge that on monday? [08:33:50] deploying then [08:34:01] I wanted to leave it during the weekend [08:34:04] ah sure [08:34:05] assuming no errors [08:34:07] to get those [08:34:09] it should be fine yeah [08:34:11] worse case scenario [08:34:19] errors on the read only passive dc [08:34:22] labsdb1009 replication still good [08:36:00] volans had a good idea to refactor the wmf-pt-kill from "profile::labs::db::kill_long_running_queries" "profile::mariadb::kill_long_running_queries" [08:36:12] it feels right to me [08:36:25] I suggested the shorter ::wmf_pt_kill, but up to you :) [08:36:29] it should be in reality in a module [08:36:32] a real one [08:36:44] but I legft it for now inside labsdvb [08:36:51] until tested it can be added somewhere else [08:37:10] so it should be inside mariadb or on its own module [08:37:24] as we shouln't add profiles inside other profiles [08:37:47] jynus: it will be it's own profile, not inside another one [08:38:00] no, its own, top-level module [08:38:23] at the moment we are using ::profile inside a profile, which is not ok [08:38:45] I didn't get your last line [08:38:48] modules/wmf-pt-kill [08:38:50] or [08:39:03] modules/mariadb/manifests/wmf-pt-kill [08:39:28] .pp [08:40:01] modules/profile/foo is perfectly valid and part of the role/profile paradigm [08:40:03] then a very simple profile [08:40:10] it is not ok, see the contents [08:40:18] (I know because I wrote it) [08:40:25] :-P [08:41:49] now I'm totally lost :) [08:42:30] so profile::mariadb::kill_long_running_queries has to happen [08:42:50] but not now, once we check the functionality works outside of labs (and it is useful) [08:43:30] but once it happens, a top level module pt-wmf-kill or part of a module (mariadb) has to contain the main, gneric funcrionality [08:43:43] what's the problem of giving the proper name to a module based on what it does? where you use it is up to you [08:43:49] that thing is already generic [08:44:00] not yet [08:44:00] (or will be with the next patch) [08:44:09] reads variables from hiera and use them for the config file [08:44:13] has nothing labs-specific [08:44:14] for example, the package [08:44:22] has labs specific uses [08:44:31] it cannot be used at the moment outside of labs [08:44:34] and that is ok [08:44:40] for the scope wanted [08:44:58] what do you mean the package has labs-specific things? [08:45:15] imagine I wanted to use that on production? [08:45:25] that package had to be rebuilt [08:45:32] it is not helpful there [08:45:35] at the moment [08:45:49] and that is ok, it is not in scope to create a generic killer [08:46:01] hence the profile::labs::db [08:46:16] it is a labsdb specific solution [08:47:26] in other words. at the moment it is a labsdb-killer [08:49:34] banyek: will you do the labsdb1011 depool today too for brooke? [08:49:43] yes [08:49:48] cool thanks [08:49:51] np [08:50:52] jynus: we were talking about the pt-heartbeat package for the future, do we have a ticket for that (a yes/no is enough, I can create it or search for it)? [08:51:07] not one for productionization [08:51:18] but search for pt-heartbeat-wikimedia [08:51:26] and see the context on how it was created [08:51:26] ok [08:51:31] jynus: personally don't agree, the puppet code is generic IMHO, but name it as you want [08:51:33] and the gerrit patch [08:51:38] 👍 [08:52:27] volans: I find sad that you don't belive when I am saying it is not gneric :' -( [08:52:50] when I am pointing to my own code failures [08:52:58] I trust you that the package is not generic, I'm saying that the puppet side it is [08:53:04] so it could be called with a generic name [08:53:10] ok [08:53:14] and you know that you have to fix other things before using it generically [08:53:16] so this is the rationale [08:53:22] but if you think this might be confusing [08:53:31] the moment it goes to generic killer [08:53:31] keep the labs-specific name for now [08:53:46] analytics (this is an example) will say, hey, let's use it [08:53:55] and other random people [08:54:00] incl labs [08:54:14] it will be confusing [08:54:19] ok [08:54:21] it will be made generic [08:54:26] that is for sure [08:54:33] but not until the package supports it [08:55:03] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) I'd happily get involved in this [08:55:20] ack [08:55:53] it is just a name :-) [08:56:00] but the current was on purpose [08:56:17] "this is for labsdb only" [08:56:35] don't use outside of it [08:57:12] I suffer from the context of the code thing you may not :'-( [09:00:40] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) the compression of the s2 tables finally finished, I'll check the others [09:02:58] marostegui: restarting dbstores replication [09:03:06] \o/ [09:03:09] I've removed the x1 filter [09:03:13] yeah, I was going to ask [09:03:15] great! [09:03:29] and you would be ok asking [09:03:39] I have updated the etherpad [09:03:43] to mark those as done [09:05:20] tendril tree view looks funny now [09:05:45] hahah indeed [09:06:00] s3 never had so many slaves! [09:07:01] banyek: after maintenance finishes we can talk if you want about pt-heartbeat-wikimedia context [09:07:16] but I would put that on a secondary level of priority [09:07:22] yeah, agreed [09:07:27] (nice to have, but nothing is on fire) [09:07:35] but it relates to your work on pt-kill [09:07:49] I wanted to ' [09:08:34] work on it when everything is quiet, I just wanted to put it on my board, etc. No rush, nor something like that [09:08:42] yes, that is cool [09:08:52] just setting expectations :-) [09:09:05] also note that if you wait "when everything is quiet" [09:09:14] you may never work on it :-) [09:11:57] we need to check the next steps for s3 after the deploy [09:15:15] what do you mean? [09:15:21] like rename tables or long term plans for s3? [09:16:43] yeah, rename, filters, etc. [09:16:57] the steps you didn't undertood :-) [09:17:03] hehe [09:17:39] who is compare.py going? [09:17:51] let me see [09:18:19] I tested page on all dbs [09:18:25] and finishing revision [09:18:36] great [09:18:37] all== the 5 dbs of s3->s5 [09:18:45] then there is the check for dewiki with codfw [09:18:51] I haven't check that yet [09:19:09] cool, labsdb1009 still good [09:19:16] the package is now completed, before I upload & install it I now create the user for it: https://github.com/wikimedia/puppet/blob/production/modules/role/templates/mariadb/grants/wiki-replicas.sql#L43 [09:19:21] let's wait for those check to finish before going for 1010 and 1011 [09:19:41] revision also checked on 3 dbs [09:19:46] all no differences [09:20:14] banyek: when creating the user make sure not to replicate that to the binlog, I am just scared of gtid and multisource (even if we don't use gtid) [09:20:56] `SET SQL_LOG_BIN=OFF` aye [09:21:16] set session [09:22:09] it's session by default, but ok no hurt to write it too [09:22:34] oh [09:22:53] I tend to play very safe with labsdb hosts, I don't want to end up rebuilding 8T XDD [09:23:15] true! ok, SET SESSION [09:23:17] banyek: labsdbs are like dbstores, no hurry, better check and go slow [09:24:10] see how we double checked every command before [09:24:31] because humans make mistakes (and not everthing can be or should be automatized) [09:24:55] Yeah, and dbstore or labs hosts are quite painful to reclone and/or fix [09:25:25] and my motto is you break it you fix it :-D [09:37:22] I disable puppet on the labsdb hosts [09:37:40] banyek: don't start replication on 1010 and 1011 [09:38:42] I did not wanted, just install the package only on one host first (as I have to test it, and if works stop the killers running in screen before that) [09:38:56] sure, just in case [09:39:14] ok, noted, will. not. touch. replication. :) [10:24:16] the only errors I saw so far is "Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode" [10:24:22] which is "normal" [10:24:50] happens on an active dc [10:25:04] and moreso on the passive due to not lag check cross-dc [10:26:14] Yeah, so far so good navigating with those wikis [10:26:16] nothing weird [10:26:36] I will run the warmup script later [10:26:39] or on monday [10:27:33] compare.py no differences, only user on some wikis missing [10:28:39] \o/ [10:29:06] I am going to run an errand + lunch [10:29:10] labsdb1009 still all good [10:29:24] So if you are confident, I am fine with enabling 1010 and 1011 after lunch [10:29:39] yes [10:29:46] let me finish user checking [10:29:53] so I can start checking dewiki [10:30:11] we could even do it now [10:30:24] I don't want to leave those hosts lagging too much [10:30:35] althouth is is onl s5? [10:30:40] or both? [10:30:52] both [10:30:56] s3 and s5 are stopped [10:30:56] s3 and s5 [10:30:59] lots of wikis [10:31:11] on 1010 and 1011 [10:31:30] about to finish user [10:31:34] Great [10:31:57] I think we are looking good, feel free to start replication if nothing arises [10:32:20] these wikis have a large text, but a very small user table [10:32:30] and user is the same, too [10:32:41] will start them [10:32:43] great! [10:32:44] yeah [10:33:25] Errand+lunch time for me [10:37:28] banyek: does starting replication on labsdb10/11 affect you in your work? [10:37:53] no, not at all [10:38:20] ok, so I will log and start it [10:42:55] ok, then changes are all right with wmf-pt-kill, I tested them, and they work. [10:43:11] Now I deploy them on all labsdb hosts, but I keep the service stopped [10:43:48] and let run the killers in screen [10:44:27] yeah, the bug on pt-kill was subtle [10:44:30] I'll resume with the service after lunch [10:44:51] (not killing prepared statements) [10:44:58] so better be careful [10:45:02] thanks, banyek! [10:46:22] ok I will be! [10:46:41] replication looking good on labsdb1010 [10:50:37] I go now [10:50:42] will be back! [11:12:21] db1072 and db1073 now have failed disks [11:17:18] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) So the wikis have been loaded into s5, and they are the primary place to read them (and eventually, write them), the only think pending is, some... [11:25:27] 10DBA, 10Operations, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) This has to be done https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS **after **the dblists are updated (without an... [12:26:47] jynus: Yeah, I was kinda expecting that after a reboot [12:26:55] really? [12:26:56] I am going to check if they have more disks about to fail [12:26:58] it explains it [12:27:05] I did for 73 [12:27:12] Aaaand? [12:27:29] all others clean except 1 with 3 media errors [12:27:36] the failed one has 27 media errors [12:27:44] I will check db1072 [12:27:47] the one with three, not smart-errored [12:27:52] sorry [12:27:55] I meant 72 [12:27:57] ah [12:27:57] I checked 72 [12:27:57] haha [12:28:02] I will check 73 [12:28:13] they are not confusing [12:28:44] 73 is clean [12:28:59] ah no [12:29:00] I did `megacli -PDList -aALL | grep rro` [12:29:01] no that clean [12:29:28] https://phabricator.wikimedia.org/P7641 [12:29:36] we changed 67 recently [12:29:46] the one with 25 is the one that failed [12:29:54] ok [12:30:10] Right now we have 72, 73 and 64 with failed disks [12:31:07] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) [12:31:41] I guess we'd need to buy more 600GB disks [12:31:47] Let's ask chris [12:31:59] I mention that on the task my passing [12:32:02] *mentioned [12:32:05] *by [12:32:12] let's what he sais [12:32:28] Ah sorry [12:32:29] but maybe we should accelerete the replacement [12:32:30] I missed that [12:32:31] Sorry [13:23:11] for the switchdc, where is tendril now? I see that role::mariadb::misc::tendril returns only db2093 [13:23:58] because db1115 I believe also got another role [13:25:02] node 'db1115.eqiad.wmnet' { [13:25:03] role(mariadb::misc::tendril_and_zarcillo) [13:25:13] oh [13:25:18] I broke that [13:26:47] volans: can we do a dry run, we had some doubts about the mysql read only steps being correctly identified [13:26:54] and that^ [13:26:59] no prob at all, just tell me the correct one for tendril and I'll update it [13:27:08] I was running dry-runs now and found it [13:27:09] well, it is db1115 [13:27:23] but it is based on a hiera key, I guess [13:28:22] did you search for a particular role? [13:28:31] maybe that should switch to search for a particular profile [13:28:42] the previous one was 'P{O:mariadb::misc::tendril} and A:eqiad' [13:28:48] so already not perfect [13:28:58] because tendril was not to be migrated to codfw etc... [13:29:12] profile::mariadb::misc::tendril [13:29:34] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-update-tendril.py#19 [13:29:47] volans: the problem is tendril functionality, was not built to be multi-dc [13:29:52] zarcillo will be [13:29:59] I know :) [13:30:02] and actually it know the right master on both dbs [13:30:07] I'll enable now the wmf-pt-kill daemons [13:30:13] (they're tested) [13:30:16] can you search for profile::mariadb::misc::tendril and eqiad? [13:30:31] sure [13:30:48] or we can setup a CNAME [13:30:54] whatever is easier [13:31:19] profile and eqiad is ok for now, I'll send the CR in a minute [13:31:28] thanks, volans [13:32:55] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/464818 [13:33:35] and now (or when you can) we can check the read-only steps [13:33:39] you mentioned above [13:41:41] can I give construtive feedback? [13:41:48] not actionable for now [13:42:21] sure [13:42:35] 'P{P:' and A: are not legible [13:42:41] maybe nice to write [13:43:09] put I woulr prefer some Puppet/profile, and you can still keep those as alias [13:43:24] but not something for now [13:43:26] I +1 [13:43:45] maybe you that are working on that everyday is easy [13:43:48] we could add a longer version in cumin, I agree that P{P: is a bit of an unfortunate combination [13:43:58] but I am just assuming those are right, but I really don't know [13:44:12] and if it was 1 letter [13:44:18] but I guess there is a lot of those [13:44:40] I dont ming having those [13:44:48] but on code is not clear [13:44:51] P{} is for the puppet backend P: is for profile:: basically [13:44:55] agree [13:44:58] not confusing :-) [13:45:22] e.g. how much we safe with profile [13:45:27] I opened a change for enabling the wmf-pt-kill, but even if you +1 it today, I'll only merge it on Monday, just to make sure it will be tracked [13:45:31] vs writing the whole stuff [13:46:01] banyek: we trust you with that, being responsable, the actual enabling is ok [13:46:14] next week I mean [13:46:15] banyek: I commented on the patch [13:46:29] did tou leave puppet on, or is it disabled? [13:49:21] I think puppet is enabled, which was my only concern [13:49:25] puppet is on now, but the service is in the `ensure => stopped` state now [13:49:31] cool [13:49:44] so not to leave puppet disabled over a long time [13:49:48] thanks, banyek! [13:49:58] (^_^) :) [13:50:25] I go now for a 1-on-1 on gehel, b/c next week I'll skip a puppet talk [13:50:45] ah, cool [14:05:09] lol volans CORE_SECTIONS = ('s1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 'x1', 'es2', 'es3') [14:05:38] consejos vendo y para mí no tengo :-D [14:05:51] XDDDDDDDDDD [14:07:08] to add the checks that we're getting the right number of hosts and such at least I need to know the number beforehand, not from the puppetdb query ;) [14:07:32] also needed for checking the lag [14:07:59] :-P [14:08:04] I think it will work [14:08:16] we made some small topology changes [14:08:24] but as it works with heartbeat [14:08:27] it will just work [14:08:33] so it is a good thing we changed it [14:08:39] less work now [14:08:48] depsite pressuring at the time [14:08:51] :) [14:08:59] it payed off [14:09:09] tell what you want to check or if you need any output to verify from me [14:09:13] *tell me [14:09:21] no, the logic looks sane [14:09:39] if you are going to do another test to migrate to codfw that is ok [14:09:47] if not, that is ok too [14:10:00] I wasn't sure what was the logic for sync [14:10:01] yes I was planning on monday morning given it's friday afternoon :) [14:10:08] yeah, that would be enough [14:10:19] to do the confusing-name-inverted-live-test :D [14:10:20] I would only have one request [14:10:23] sure [14:10:40] actually, forget it [14:10:47] no need anything from you [14:11:03] * volans stack.pop() :) [14:11:04] we just need to prepare for a potential switch back aside [14:11:25] we will talk about that on monday, marostegui [14:11:38] renames + filters [14:11:42] ep [14:11:44] yep [14:11:49] but that is not swtichdc related [14:11:58] it will happen a day or some hours before [14:12:24] and we promise to alex to have the switchback prepared too [14:12:36] *promised [14:12:39] well, I did [14:13:50] jynus: I have updated the checklist on the etherpad, I mean all the DONE points [14:13:53] GIve it a look later [14:14:36] thanks [14:15:04] 10 and 11 done [15:03:16] I leave now for about 1-1,5 hours, and then pool back the labsdb1011 host for bstorm [15:03:28] (If she'll complete then) [16:31:58] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Cmjohnson) Failed disk has been swapped out [16:35:31] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Cmjohnson) Failed disk has been swapped out [16:40:12] 10DBA, 10Operations, 10ops-eqiad: db1064 has disk smart error - https://phabricator.wikimedia.org/T206245 (10Cmjohnson) Swapped the failed disk [16:46:55] it's weird [16:47:13] we have depooled labsdb1011 from dbproxy1010 [16:47:26] but connections are showing up [16:47:32] not a lot [16:47:37] but rarely one [16:49:31] https://www.irccloud.com/pastebin/x5ZngjRV/ [20:07:55] all the labsdb hosts are back to action, now I leave. bye