[04:32:58] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) p:05Triage→03Medium [04:46:56] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) labsdb1011 seems to be working fine. Good news is that 10.4+Buster seems to confirm that the CPU usage is a lot better and the host isn't having almost 100% usage as i... [06:44:31] https://mariadb.com/kb/en/mysqluser-table/ 🤦 [07:10:26] marostegui: "mysql.user is now a view" - i'm guessing that's.. problematic? [07:10:51] kormat: I knew that but I never thought of the implications when doing a logical import from 10.1 (where it is not a view) to 10.4 (where it is) [07:11:15] I was providing feedback to the mariadb bug about the grants, and I realised that that can be very problematic indeed when dealing with logical backups or imports [07:11:20] does this explain the missing grants? [07:11:24] I have asked to see what's the workaround [07:11:32] kormat: most likely related yeah [07:11:47] yay/crap [07:11:50] although I want to see what's the suggested path for a logical import [07:12:19] * kormat nods [07:12:27] pipe the import through `sed`? ;) [07:17:33] marostegui: logical import for s4+s5 completed on db2137. do i need to do something to fix grants there? [07:19:49] grants are not backed up precisely because of things like that [07:20:16] remove the import user and add the production ones on puppet [07:23:02] i've removed the import user - where do i look in puppet? [07:23:31] modules/role/templates/mariadb/grants/production.sql.erb ? [07:30:12] those are the common ones [07:30:21] core hosts also have the production-core ones [07:30:34] how do i use this? [07:31:55] I cannot really say [07:32:16] there is no grant handling system [07:36:57] jynus: i'll rephrase. what would _you_ do at this point? [07:37:09] I'll ask marostegui :-D [07:37:14] hah, ok :) [07:37:47] or maybe copy the grants from a production host with pt-show-grants [07:38:03] I don't think I've setup a host from 0 in a long time, or ever [07:38:17] from 0 as in, without using binary backups [07:38:40] logical backups are only used for partial recoveries [07:39:11] sounds like we're in need of some documentation. i can write some when i am blessed with the knowledge [07:39:55] so in the past, those grants were writen to every host [07:40:14] I removed that, as I thought that writing clear text passwords to every host wasn't a wise idea [07:40:28] then we told m*rk we needed an account handling system [07:40:59] jynus: btw, https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Enabling_replication_on_the_recovered_server mentions 'neodymium', which i think is decomissioned [07:41:00] or service, and that it couldn't be on puppet as we needed to link client and server hosts [07:42:11] If you see the below link, you will see I didn't write that [07:42:21] but I guess it is cumin hosts now? [07:42:39] hey you told me to give feedback on issues with the backup documentation - this is backup documentation :) [07:42:44] nope [07:42:52] I got you the data recovered [07:43:02] oh i see how it is :) [07:43:03] this is mysql productionization :-D [07:43:48] and I would never use a logical copy for this [07:44:27] there is no yet a procedure for this [07:46:35] not procedure and no infrastructure to be fair [07:55:06] this is also unclear with the whole mysql.user + upgrade issue [08:35:00] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [09:54:21] 10DBA, 10Cloud-Services: Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Kormat) Sanitization is in place, waiting on a complete private data check before handing it over to Cloud for view creation. [09:54:25] 10DBA, 10Cloud-Services: Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Kormat) Sanitization is in place, waiting on a complete private data check before handing it over to Cloud for view creation. [10:09:56] going to stop the event scheduler on tendril for some testing [10:10:20] ok [11:33:33] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) So labsdb1011 looks stable. CPU seems to be stable at around 30% usage (which is a big improvement compared to the previous values). Lag grows, but that is something... [14:19:21] 10DBA, 10Cloud-Services: Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Kormat) Private data check passed. I've created `awawiki_p`, and given the `labsdbuser` role a grant to it. This is now ready for #cloud-services-team to create the views. [14:20:07] 10DBA, 10Data-Services: Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Kormat) a:05Kormat→03None [14:21:07] 10DBA, 10Cloud-Services: Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Kormat) Private data check passed. I've created `gomwiktionary_p`, and given the `labsdbuser` role a grant to it. This is now ready for #cloud-services-team to create the views. [14:21:37] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Kormat) a:05Kormat→03None [14:22:00] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Kormat) [14:24:39] jynus: here I am :-) [14:24:51] so recap [14:24:58] so you are aware of current status [14:25:07] and m*rk actually asked you to be my backup [14:25:15] well not necerarilly you [14:25:26] but you would be probably the best candidate [14:25:39] in case something happens so you know the last updates [14:25:43] and I am not around [14:26:00] backup1001 and backup2001 are almost equaliy the same as before [14:26:10] helium an heze [14:26:17] ok [14:26:19] except they have a second array [14:26:34] array2 with the storage and pool of Databases [14:27:01] backup2002 has the offsite through "working" copy jobs [14:27:13] and an empty Databases storage [14:27:24] the idea is to start using it [14:27:51] right now backup1001-Databases gets a job from dbprov2* hosts [14:27:56] actually 4 jobs [14:28:12] 2 for logical backups and 2 for binary backups, one per physical machine [14:28:18] but they all go to the same pool Databases [14:28:40] so immediate change is to enable DatabasesCodfw (I accept alternative names) pool [14:28:57] to store dbprov100* stuff [14:29:05] they should be equivalent between dcs [14:29:12] just taken from local dc dbs [14:29:24] up until now everyhing ok, akosiaris? [14:29:42] (I think you should know that or understood it from the patch) [14:30:02] now backup1002 and backup2002 and its "double appearance" [14:30:07] yup, now it's clear [14:30:18] 2 new machines were bought [14:30:52] it was yet unclear the roles, but we had the goal of content backups, and we used for them (30TB from the start) [14:31:39] now, physically they are individual machines, but the idea is that they are just an extra dbprov and an extra backup director+storage combined [14:31:50] it is just a lot of bytes [14:31:57] just to generate the backups [14:32:06] that they wouldn't fit on dbprov hosts [14:33:02] so they generate locally backups of the local dc (es hosts in this case) [14:33:23] ah, I had missed that part [14:33:25] but as the graphs shows, they store the remote's backups into bacula [14:33:42] so it is like if they were 2 machines (2 roles) [14:34:02] the role will be something like "bacula storage + dbprov" [14:34:29] they make sense to be togheter as they are "content database backup hosts" [14:34:38] which are very large, even larger than otrs!!!!! [14:34:42] :-DDDDD [14:34:50] ok ok, so the dbprov "part" of them stays dc-local and the "bacula-sd" part of them backs up their mirror machine in the other DC [14:34:55] yes [14:35:00] now that clears it up [14:35:01] thanks! [14:35:07] so in the end everyhing will be simetrical [14:35:10] except the director [14:35:20] which we will only have one, for now [14:35:21] ok, so you need indeed 2 pool [14:35:28] 2 pools* [14:35:28] yep [14:35:38] forget about content [14:35:46] the path is first just Database [14:35:49] but content will be the same [14:36:06] I mean, I don't technically need 2 pools [14:36:12] but I think for indepenency [14:36:14] it makes sense [14:36:19] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [14:36:36] what I need is 2 storage daemons, which could be a single pool [14:36:44] but I think everyhing separate is cleaner [14:37:03] we could also separate production into more pools in the future [14:37:08] if needs arise [14:37:13] I don't think a pool can span anyway 2 storage daemons [14:37:29] oh, I thought it was, but honestly I didn't check [14:37:41] because I started with separate pool design from the start [14:37:46] I am not sure, I 'd have to double check, but I 've never done it [14:37:55] it doesn't matter :-D [14:38:01] I think separate pools is better [14:38:08] yes, agreed [14:38:13] is like the 2 arrays, I could have virtualized into 1 big storage [14:38:24] but when we have hw issues [14:38:32] we will be glad we separate by hw [14:38:44] so going more concrete [14:38:48] so my only complaint about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/598005/2/modules/profile/manifests/backup/director.pp would be that at some point FileStorageDatabases should become FileStorageDatabasesEqiad to match the newly added pool [14:38:57] I would agree [14:39:01] otherwise, the approach LGTM [14:39:03] I would have to check if I can rename it [14:39:08] without losing data [14:39:24] yes "just one database edit" [14:39:36] but just let me go slowly 0:-D [14:39:36] but you are saying in the comment "I don't think this method scales for further modifications later on." [14:39:40] yes [14:39:40] what further modifications? [14:39:44] I wanted to get to this [14:39:56] clearly this was started as having almost only 1 pool at the time [14:40:08] and then you probably added archive [14:40:13] and I added Databases [14:40:27] I think a refactoring is due [14:40:39] Still have a default Pool [14:41:02] but make the others just optional things, as I said on the comment [14:42:03] I cannot remember the exact class name but profile::backup{ fileset => /srv, pool => ContentDatabasesCodfw} [14:42:18] default pool production [14:42:39] then configure the job with a pool on the job config [14:42:46] and leave job defaults without the pool [14:42:55] also fix some hiera key handling [14:44:16] For example, in the future we may have a Swift pool or whatever [14:44:18] yeah overall I 've been meaning to get to refactoring this after we move to puppet 4. It has been that way since puppet 2.7 because of the weird constructs of puppet to do loops [14:44:28] don't worry [14:44:34] I just want your blessing [14:44:40] to go in that direction [14:44:41] puppet 4 is come and gone, we are now at 5.5 IIRC but I haven't gotten to it [14:44:47] yeah sure, go ahead [14:44:50] because the usage changed [14:45:02] you can even ditch jobdefaults if they bug you too mcuh [14:45:15] 1 pool => 1 pool and other stuff => several pools, some of them much larger than produciton [14:45:33] akosiaris: I think they are ok, although because we don't change anything [14:45:41] they may be redundant to schedules [14:45:58] they are meant as a shortcut for people not having very good config mgmt systems. If your puppetization proves adequate we don't need them really. They do tend however to make configuration shorter and easier to read after you grow accustomed to them [14:46:08] oh, yes [14:46:14] I want to maintain job defaults [14:46:30] I just don't want to maintain job defaults for tuesday on databases pool [14:46:50] job defaults for monday on production-codfw pool [14:47:04] ah, yes, the per day jobdefaults is there to accommodate for puppet loops :-( [14:47:12] the whole idea is to abstract [14:47:12] well, the lack of proper puppet loops [14:47:29] and I think we should continue abstracting [14:47:43] but I am asking your ok to do it in a slighly different way [14:47:59] pool being a "first order" configuration value [14:48:01] sure, go ahead. +1 ! [14:48:08] and other things we don't change, merge them [14:48:17] also wanted you to be awere of the changes [14:48:27] "WTF is DatabasesCodfw!" [14:49:05] and it it works well for databases, think if/how to apply it to production [14:49:16] which doesn't have a dbprov equivalent [14:49:26] maybe just plain duplicate on both dcs? [14:49:45] could we handle double the number of backups? [14:49:59] homework for later :-D [14:50:41] for example, I belive because many hosts import the same profile [14:50:50] we may have duplicated backups [14:51:24] I would like to work with each service owner and see if we can do better [14:51:28] but that is for later [14:51:29] jynus: I +1ed the change on premise. Feel free to capture this conversation there or anywhere else you feel most suitable (or even nowhere for that matter) [14:51:45] thanks for taking the time to explain it :-) [14:51:52] I think I will document the Databases model [14:51:57] on the wiki [14:52:00] thanks for your time [14:52:22] jynus: I recall you made a picture of how the bacula hosts will integrate within dbprov with arrows and all that, but I am not finding it [14:52:39] sorry marostegui , I send him the link on the other channel [14:52:47] ah ok [14:52:48] Checking [14:52:50] thanks [14:52:50] it was https://phabricator.wikimedia.org/T79922#6065295 [14:52:56] right that! thanks [14:52:58] I went here to spam you [14:53:05] :p [14:53:07] because it was becoming too detailed for -security [14:53:48] join me to thank akosiaris for keeping up with the changes [14:54:06] so he can save us if I am not around to recover db stuff :-D [14:54:24] I will annoy kormat at another time [14:54:28] about this [14:54:39] * kormat hides [14:55:01] but I think it is important to buy netops, serviceops, and others in my ideas [14:55:11] as well as people trying to poke holes on it [14:55:15] early in the process [14:55:25] as you did too, marostegui [14:55:36] I want everybody happy with the changes [14:56:04] jynus: let's put that useful image somewhere within https://wikitech.wikimedia.org/wiki/Bacula ? [14:56:10] so it is easier to find than a phab ticket? [14:56:11] Yeah [14:56:16] I will [14:56:21] I was about to reflect this there [14:56:22] <3 [14:56:24] with the plan [14:56:34] although I may be sparse on words [14:56:48] as, if it works well, I may extend the model to all backups [14:59:17] I said btw, marostegui, and I hope that you agree, that db backups on dbprov should normally be enough both for provisioning and emergencies [14:59:43] and that that would be enough to counter act that long term backups on a different dc woudl take more to recover [15:00:16] which was what we discussed about improved reliability but lower performance of cross-dc [15:01:10] jynus: yep, we agreed on that one [15:01:33] I will try to put all that on http://en.wikipedia.org/wiki/Special:Search?go=Go&search=Bacula [15:01:47] arg, auto replace text [15:01:54] 10DBA, 10Patch-For-Review: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Kormat) [15:02:03] http://en.wikipedia.org/wiki/Special:Search?go=Go&search=wt:Bacula? [15:02:29] not sure what is the interwiki of wikitech [15:06:07] wikitech :) [15:06:09] https://meta.wikimedia.org/wiki/Interwiki_map [15:06:30] I would have tought it was labswiki [15:07:07] Nope, that's just the db name... :P [15:08:19] Going to stop the event scheduler again on db1115 for a minute [15:08:39] ok [15:09:24] back again [15:22:31] marostegui: akosiaris: https://wikitech.wikimedia.org/wiki/Bacula#Architecture_update_%282020%29 [15:22:43] please feel free to copy edit [15:24:37] akosiaris: I will not document the code changes, because I have not yet done them, they are implementation details, and it should not alter the "interface" of puppet setting up a backup [15:24:55] I missed a yet there [15:26:10] ok