[00:10:48] PROBLEM - 5-minute average replication lag is over 2s on db2098 is CRITICAL: 482 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2098&var-port=13313&var-dc=codfw+prometheus/ops [02:16:58] RECOVERY - 5-minute average replication lag is over 2s on db2098 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2098&var-port=13313&var-dc=codfw+prometheus/ops [05:11:54] 10DBA, 10Phabricator: Upgrade m3 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T259589 (10Marostegui) >>! In T259589#6374543, @mmodell wrote: > @marostegui: That works for me. Thanks - I have updated the calendar event. [06:15:04] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) a:03Marostegui [06:15:50] 10DBA: Compare a few tables per section before the switchover - https://phabricator.wikimedia.org/T260042 (10Marostegui) p:05Triage→03Medium [06:16:28] 10DBA: Compare a few tables per section before the switchover - https://phabricator.wikimedia.org/T260042 (10Marostegui) [06:16:31] 10DBA, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-Banyek: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10Marostegui) [06:43:56] backups are ongoing, should be finished in 1 hour or so [06:56:50] great [07:10:18] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) Schema change times on s8 (w... [07:10:21] Amir1: ^ [07:12:49] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [07:53:02] https://mariadb.com/kb/en/mariadb-10414-release-notes/ [07:54:16] are you suspecting MDEV-22497 to be relevant to us? [07:54:50] oh, I just saw your comment [07:55:30] haha [07:56:00] it actually might be, there's lots of uncertainties about all those bugs [07:56:10] do you think that would fix labsdb problems? [07:56:23] Who knows...unfortunately we cannot really test it [07:56:27] As in, we don't have time for it [07:56:39] We could once we have the new hosts [07:56:45] so if we "lose" one of the old ones, it is "ok" [08:00:27] marostegui: Thanks! [08:25:05] marostegui: random note: the script has finished 270 wikis out of 900, it's going to take two more days and it will be done [08:34:26] Amir1: thanks for the heads up [08:57:40] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) 05Open→03Resolved This was successfully done. Big thanks to @Urbanecm for all the testing and driving this from MW side! Thanks also @Ladsgroup... [08:57:42] 10DBA: Move more wikis from s3 to s5 - https://phabricator.wikimedia.org/T226950 (10Marostegui) [08:57:53] 10DBA, 10Datasets-General-or-Unknown, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Ladsgroup) [08:58:57] 10DBA, 10Datasets-General-or-Unknown, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Ladsgroup) [08:59:17] 10DBA: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Ladsgroup) [08:59:21] 10DBA: Remove muswiki and mhwiktionary from s3 - https://phabricator.wikimedia.org/T260112 (10Marostegui) [08:59:36] 10DBA: Remove muswiki and mhwiktionary from s3 - https://phabricator.wikimedia.org/T260112 (10Marostegui) p:05Triage→03Medium [08:59:43] 10DBA: Remove muswiki and mhwiktionary from s3 - https://phabricator.wikimedia.org/T260112 (10Marostegui) [08:59:46] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) [09:08:33] 10DBA, 10User-Urbanecm: Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) a:03Urbanecm Assigning to @Urbanecm for the documentation change. Thank you very much [09:12:59] 10DBA, 10User-Urbanecm: Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Ladsgroup) The bot is updated: https://github.com/Ladsgroup/Phabricator-maintenance-bot/commit/e04aa3a936ab339ef87ab49849b1b9709750abe4 [09:13:44] 10DBA: Remove muswiki and mhwiktionary from s3 - https://phabricator.wikimedia.org/T260112 (10Marostegui) I have renamed the tables on s3 master only (db1123) for `muswiki` and `mhwiktionary`. If any writes attempts to happen, it will fail. Let's leave this for a few days while we monitor logstash for any issues... [09:24:37] 10DBA, 10User-Urbanecm: Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Urbanecm) Updated shard in https://meta.wikimedia.org/w/index.php?title=Template:New_wiki_request&diff=20355000&oldid=20353965 to be s5 by default. Add a wiki docs were [updated](https://... [10:37:57] now that I remember, there is a meta table/database that tracks the sections of each wiki- not sure if that has to be manually changed [10:38:50] for example, muswiki still appears on s3 at https://replag.toolforge.org/ [10:39:03] although not sure if that is on the meta db/table or on dns? [10:39:17] jynus: interesting.... [10:39:23] jynus: I will ask cloud, thanks! [10:39:26] I would check and document that [10:39:34] yes, with check doesn't necesarilly mean yourself [10:39:37] check with cloud [10:40:06] I know it is automated on new creation [10:40:16] but maybe a change was not conceived [10:41:13] 10DBA, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) @bd808 @Bstorm we have moved two wikis from s3 to s5. While nothing needs to be done data/views-wise on labsdb hosts, Jaime has noticed that... [10:41:15] asked ^ [10:42:20] 10DBA, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10jcrespo) Not sure if there is change needed on dns per-wiki endpoints too. [10:42:26] I've added the dns stuff [10:42:29] to the question [10:44:27] cheers [10:46:58] kormat: I am ok with the patch, but I suggested an even slower deployment strategy. Also, did you test the package on stretch? is it uploaded on the stretch repo, or only on buster? [10:47:29] jynus: i've been working with marostegui on a more refined deployment plan, which is here: https://phabricator.wikimedia.org/P12206 [10:47:39] but good point re: stretch. i'll update the plan with that in mind. [10:47:41] oh, sorry, I hadn's een that [10:48:32] I see, it has a similar plan than what I suggested [10:48:44] 10DBA, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) @elukey not sure if there's something from your side when you sqoop stuff (even though those two wikis are closed), see: T259438#6375572 [10:48:53] I just suggested the additional run comin sudo -u nagios after line 10 [10:49:01] *cumin [10:52:26] I am a bit worried about the amount of custom packages/software we will have to maintain at some point: wmf-mariadb, wmf-mariadb-client, wmfmariadbpy, transferpy, wmf-pt-kill, ... [11:17:42] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Memory_issues [11:18:14] thanks! [11:29:57] db1077 is available for testing?or still taken by otrs tests? [11:40:54] it is still using otrs [11:41:24] the one on codfw is free- it has an enwiki replica that is not in use [11:43:11] ah cool [11:43:15] I will use that one [11:43:17] thanks [12:35:26] 10DBA, 10Operations, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) [12:35:39] 10DBA, 10Operations, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) p:05Triage→03Medium a:03Kormat [12:43:07] huh. we have this even in 10.4 /etc/my.cnf's: [12:43:08] `plugin_load = rpl_semi_sync_slave=semisync_slave.so` [12:44:16] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [12:44:26] and that should be removed, cause it is probably failing on start [12:44:45] i think it causes a spurious error in the logs [12:44:54] yeah for sure [12:45:08] i've added it to the puppet refactor task so it doesn't get lost [12:45:29] we have some ifs already on the my.cnf for the version [12:45:35] I believe [12:46:16] yeaah. though it's a bit ugly: `<% if @basedir == '/opt/wmf-mariadb104' -%>` [12:46:39] yep [12:57:31] hmm [12:58:03] volans: wmfmariadbpy depends on cumin, which means that installing packages which depend on wmfmariadbpy pull in the cumin package; is there an issue with cumin getting installed on non-cumin hosts? [13:03:16] please don't [13:04:52] welp, then we have a problem. [13:04:55] or at least not on production hosts [13:05:07] didn't you create a separate package? [13:05:29] jynus: in version 0.2, python3-wmfmariadb contains CuminExecution, so it depends on `cumin` [13:05:57] sure, but you were only going to install on cumin hosts, right? [13:05:58] wmfmariadbpy-common (which contains db-check-health, the replacement for check_mariadb.py) depends on python3-wmfmariadbpy [13:06:25] that seems bad- wmfmariapy should depend on common, right? [13:06:49] `wmfmariadbpy-common` is a binary package. `python3-wmfmariadbpy` is a library package [13:07:04] then the split hasn't been done correctly [13:07:20] the whole idea was to separate cumin packages from non-cumin ones [13:07:34] that's news to me, as the person working on this [13:08:13] ah. what i have done is made separate binary packages for cumin and non-cumin hosts [13:08:13] that was basically my only suggestion- to not install cumin on regular mysql hosts [13:08:21] but they both use python3-wmfmariadbpy [13:08:23] which depends on cumin [13:08:27] I let you both handle how [13:08:45] maybe common should depend on that? [13:08:48] *not [13:09:01] not sure what is on which package honestly at this time [13:09:48] but if it is not too late, you can revert the remote execution stuff and put it back on transfer package if that helps? [13:09:56] (not sure if that would) [13:10:01] that's not included in version 0.2 [13:10:11] version 0.3 (not yet sent for review) is doing that [13:10:34] ok, so the issue is only on 0.2? [13:10:42] can we just skip that? [13:11:12] version 0.3 has the same problem, but i can fix it there [13:11:29] the issue is that i've uploaded version 0.2, and am in the middle of deploying it [13:11:47] there is always time to reverty [13:11:51] abort [13:11:58] do not install cumin everywhere [13:12:20] jynus: do you have specific reason to think it will cause problems? [13:12:36] (otherwise, e.g., i could go ahead now, and when version 0.3 is deployed remove it again) [13:12:40] a remote execution library on every host? [13:12:52] that is a recipe for disaster, even if it doesn't work [13:12:59] it also installs a bunch of dependencies [13:13:03] we don't want [13:13:21] ` cumin python3-clustershell python3-colorama python3-pyparsing python3-tabulate python3-tqdm python3-wmfmariadbpy` is what would get installed [13:15:35] I ask you to please not install cumin on all databases, but manuel has the last word [13:17:08] I think we might need to involve more people here, like volans or moritz to see which alternatives we have and what each approach (either install it or don't) bring us [13:17:19] I honestly can see both sides [13:17:49] I'm still on the move, so bare with me if I misunderstood the backlog, just skimmed [13:17:59] is it possible to make cumin an optional dependency in the library package? [13:18:14] so that some functionalities will not be available if not installed [13:18:19] as kormat says, it is possible to make it not a dependency [13:18:39] so I don't see any reason not to wait and fix the bug before full deply [13:18:59] jynus: i've been trying to do small contained changes [13:19:28] volans: possibly, not 100% sure [13:19:46] as optional I mean in setup.py and debian/control terms [13:19:48] sure, but this was discovered as not ideal, so I prefer to revert and at least reevaluate- not saying to not do it, just to pause for a second [13:19:53] it's possible to fix this with a bunch of package refactoring [13:19:57] then if you use that feature it can crash for now, that's ok [13:20:05] cumin will not work anyway from the db hosts [13:20:30] * volans brb [13:22:25] cannot we quickly remove the dependency as volans suggests? [13:22:36] and as you say, think and refactor later? [13:26:45] i think it would be messy [13:27:45] (the only way i can think to do it would be to tell setup.py that cumin is optional, which can screw up python development) [13:28:26] so I say revert and rethink rather than go through ahead with a plan that doesn't solve the problem that motivated it (improve deployment and development), just my opinion [13:28:36] there is always time to redeploy [13:29:37] I don't think we should rush something unless it is an emergency [13:30:10] I am also offering help [13:30:22] i'm trying to figure out how to revert [13:30:28] so you don't feel like I am just talking and no acting [13:30:41] if I understand correctly, I think the issue is remote execution [13:30:58] so I can make that its own package to solve the dependency issue [13:31:14] kormat: what has been done so far? [13:31:27] let me help [13:31:47] i build version 0.2, uploaded it to apt, and installed it on cumin 2001 [13:31:54] there's an on-going discussion in #-sre [13:31:55] that is ok [13:31:59] that a non-issue [13:32:13] is there puppet to install that on all hosts yet? [13:32:20] or manual commands? [13:32:38] because if that is only, we don't need to "revert" anything [13:32:53] please see the discussion in #-sre [13:33:10] I saw that, that is ok [13:33:22] we do that all the time with new mysql versions [13:33:24] we upload it [13:33:36] you can go ahead and install it on cumin [13:33:52] but only manually upgrade [13:34:06] you can keep the 0.2 on repo [13:34:46] or if you really want to go back to 0.1, it is not the cleanest way, but you can upload and replace the 0.1 version [13:35:03] that will work [13:35:45] but if there is no puppet change would that even be needed? [13:36:06] i dislike having a version in apt that should not be used [13:36:21] but what is the issue with having 0.2 on cumin? [13:36:25] I didn't get that [13:36:35] I only worried about other hosts? [13:36:46] cuminX001 hosts, I mean [13:37:32] would it be harder to fix? [13:42:36] * volans back [13:43:30] let me know when you are back and I will help you do any of the 2 options [13:44:41] marostegui: you're going to have so much fun with T260111 [13:44:41] T260111: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 [13:47:14] jynus: back. you have a good point. the current puppet repo only installs wmfmariadbpy on cumin hosts, where version 0.2 is perfectly fine [13:47:25] ok, so my words were inexact [13:47:40] I wasn't asking as much for a revert as for "let's pause for a second and think" [13:47:56] let me help you with anything you want [13:48:12] I am blocked with some of my goals and I really want to help you, not be a burden 0:-) [13:48:36] we can do as many 0.X versions as needed to get it right [13:48:42] don't worry, ok? [13:48:47] i'm good now, thanks :) [13:49:04] but it is much easier to fix it when installed in 2 hosts than when installed on 200 [13:49:07] i'll send a CR to you in a bit [13:49:07] hope you get my point [13:49:11] absolutely :) [13:49:29] i didn't even install it on a db host. i stopped when i saw it wanted to pull in cumin [13:49:38] ok, I understood the opossite [13:49:46] because you said "you were in the middle of deployment" [13:49:55] undertanding that you were mass-installing it [13:49:55] ah right [13:50:06] nothing happened yet in my opinion [13:50:19] fair :) [13:50:19] let's maybe take the time to even test 0.2? [13:50:33] that is why I said "revert before it is too late" [13:51:04] in any case, you could also have force-upload 0.1 and it would hace for the most part worked [13:51:15] except it wouldn't be rolledback on already installed hosts [13:51:34] * kormat nods [13:51:47] the problem is this is getting more and more complex [13:51:49] for that host i would have uninstalled it and then re-installed it, which would get back to 0.1 [13:51:53] yep [13:52:11] I am regreting the package management for deployments [13:52:24] not entirelly, because there is no alternative [13:52:31] but this is now a 5 package issue? [13:54:32] mm. i think at this point i'll probably drop the `wmfmariadbpy` metapackage [13:54:37] this is the solution https://phabricator.wikimedia.org/P12208 ? [13:54:53] do you want me to take remoteexecution out of your plate ? [13:54:57] so cumin hosts will end up having a dependency on `wmfmariadb-admin`, and db hosts will depend on `wmfmariadb-common` [13:55:18] yep, pretty much re: that paste [13:55:22] no that's fine [13:55:41] let me help you in any way, I don't want to give you overhead and put it all on your shoulders [13:56:08] jynus: all i need is CR reviews when i get a chance to send them [13:56:11] but i appreciate the offer [13:56:17] ok, I do that, you know it [13:56:27] yep :) [13:57:06] and sory again for the quick revert, I am pretty much a revert, think later person 0:-) [14:58:05] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) s8 eqiad progress [] labsdb... [16:04:39] 10DBA, 10Operations, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Marostegui) p:05Medium→03High Setting it to high as we don't have many "old" masters with 10.4 but we already have some that would use this script: x1, es4, es5... [20:37:12] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) ugh. @jcrespo, I apologize, I let the ball drop on this one. It wouldn't take much effort on my part, we already have the puppet scaffolding to support separati... [21:01:58] 10DBA, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10bd808) >>! In T259438#6375572, @Marostegui wrote: > @bd808 @Bstorm we have moved two wikis from s3 to s5. > While nothing needs to be done data/views-wi... [22:33:07] 10DBA, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Urbanecm) Makes sense, thanks. @Marostegui Do we have any docs for changing the shard/slice/section/whatever the terminology is? If so, can we add the in...