[06:12:15] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) >>! In T246970#5952464, @Mike_Peel wrote: > I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things a... [06:12:59] 10DBA, 10Performance-Team, 10WMF-JobQueue, 10Wikimedia-Rdbms, and 2 others: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T218692 (10Marostegui) Sure, we can try to coordinate and test it on codfw with mwdebug indeed somet... [07:15:33] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2114.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003090715_... [07:35:43] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2114.codfw.wmnet'] ` and were **ALL** successful. [07:46:56] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [07:54:16] https://usercontent.irccloud-cdn.com/file/mrI9N6PQ/image.png [07:54:44] woot!!!! [07:54:48] <3 [07:55:17] Once that pink line gets to the bottom, we just have 87 million + to do, which will be less than 1 million items and take just a few hours i hope :) [07:55:34] and of course a dobule / triple check to make sure we didnt miss anything [08:10:10] 10DBA, 10Cleanup, 10Data-Services: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10Marostegui) @Bstorm @JHedden @bd808 can you guys please remove the view for this database? It has already been deleted, but the views are still there and hence tri... [08:30:39] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) db2126 is the sanitarium master for s2, so I am not going to re-image it. Will pick another one for s2 [08:45:50] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003090845_... [09:13:18] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2125.codfw.wmnet'] ` and were **ALL** successful. [09:14:39] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [09:21:04] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [09:27:14] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [10:15:16] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10akosiaris) As far as OTRS goes I can be around and help with restarts/verifying behavior and all that jazz. Pick dates that suit you and le... [10:19:14] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) >>! In T246098#5953169, @akosiaris wrote: > As far as OTRS goes I can be around and help with restarts/verifying behavior and a... [14:51:18] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:52:03] jynus: Can I restart db1102 for upgrade? No backups running on it [14:54:54] em [14:55:03] sure [14:56:54] thanks :* [15:04:37] all done [15:04:53] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:05:59] I am going to do db1116 too, given that nothing is running there either [15:13:56] done! [15:17:24] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:31:45] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:46:58] FYI there's a standard partman recipe for hosts using hardware raid now, it'll do lvm over /dev/sda [15:47:18] let me know if that's interesting to you and/or how to help [15:51:26] godog: we normally tend to allocate as much as possible anyways to our DBs [15:51:37] so lvm won't help here in that case much [15:53:40] fair enough! looks like the current recipe uses lvm even now though? e.g. db1078 [15:53:47] maybe histerical raisins [15:54:27] anyways mostly FYI, if it doesn't work for your use case that's fine too! [15:54:36] godog: oh yes, sorry, for some reason I assumed you meant lvm to create specific volumes or something [15:54:43] like to have spare free space if needed [15:55:18] ah yes you are correct marostegui that by default the standard recipe will leave a little bit of space free on the VG [15:55:43] godog: yeah, and we don't use that, we just allocate as much as possible to the data volume [15:55:48] on buster that is, on < buster that setting doesn't work [15:56:16] what setting doesn't work? [15:56:34] the instruction for partman to leave some vg space unallocated [15:56:58] Ah, really? I have seen that working on centos [15:57:02] Like years ago [15:57:39] Maybe it wasn't partman...cannot remember [15:57:39] yeah easy to believe on centos it works as expected, like with anaconda IIRC [15:57:48] yeah, it was with anaconda [15:58:46] hehe yeah, with partman in the past we had to use tricks to achieve the same thing, should be all gone now [15:59:25] anyways when/if the time comes and the standard recipes are interesting to you LMK, we can override partman easily to not reserve space [16:00:20] will do, thanks godog [16:00:27] I updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/577462 [16:00:40] yeah, we could just add a variant to set partman-auto-lvm/guided_size to 100%, the only other difference from the curretn DB partman recipes is xfs vs ext4 I think [16:00:57] I have always had the doubt on whether it is better to allocate everything or allocate X% and then have that reserve as an emergency [16:01:07] jynus: checking [16:01:13] ah yeah good point moritzm [16:01:39] I personally like a little bit of space for emergency/panic situations as online resize is safe nowadays [16:01:51] online grow that is [16:02:51] yeah [16:02:56] the eternal doubt [16:12:52] we eliminated that once we run out of space back in the 1.5TB disks [16:15:32] if I may, beyond recipe standardization, ext4 or unallocated space, we have more important partman issue ATM: disable automatic reimage properly and a recipe to skip formatting of /srv [16:19:16] what do you mean with skip formatting of /srv, do you want to use a raw device instead of a file system? [16:20:33] I want to reimage the root partition without touching the data, that lives on /srv [16:20:36] moritzm: no, to not destroy the partition on accidental PXE reboot [16:20:52] well, both things for different reasons [16:21:04] one for partman to fail (I achieved that, but by chance) [16:21:19] I would like something at pxe level, but needs work [16:21:33] and one for partman to work, but only for os upgrade [16:22:15] just to be clear, I am not needing that now, but it is higher priority than other partman fixes ongoing [16:22:35] problem is it takes a lot of time and tests [16:22:46] and partman formatting changes every new release :-( [16:23:32] that part (reimaging while retaining /srv) is actually one of the things which are made much more simple with the partman standardisation; we only need to ship a variant of standard.cfg which skips an existing /srv (as opposed to what we had before where this was cluttered across 60 different recipes) [16:23:33] it also doesn't help that /srv is withing a lvm group [16:24:27] not complaining about standarization :-D [16:25:18] but it seems like a regression if now we cannot use xfs or other things because of that [16:27:02] anyway, we haven't found a way yet to solve it that works, in any format (but we didn't have much time to work on that either) [16:28:20] yeah not formatting /srv is an interesting use case for sure, iirc it would apply to other use cases too once we solve it [16:29:43] that is my main point, not having too much time to work on all of these, lot of technical debt! [16:29:44] variants for XFS can be added, but first we need to find out why some roles chose XFS in the first place (be it either performance, reliablity or specific features) and if those reasons still holds, any input welcome at https://phabricator.wikimedia.org/T156955 [16:30:48] ext4 may be reliable for many small files with kernels after buster (maybe) [16:31:24] fs developers they have been promising all bugs fixed for next version since 2.6 :-D [16:33:41] https://www.percona.com/blog/2019/11/12/watch-out-for-disk-i-o-performance-issues-when-running-ext4/ [16:33:52] please note that xfs is not great either or has no bugs [16:34:24] but it has been consistently "ok" for the last years, probably due to lack of innovation [16:35:16] this is not theretical, s3 architecture, which we don't like either, is a challenge for mariadb, but a use case we have to support for now [16:35:30] s3 as section 3, not Amazon service :-D [16:37:14] I think ext4's version of "It's never RICO" would be "It's always O_DIRECT" ;-P [16:39:05] ack, thanks for the context! [16:39:49] sorry we are hard to work with [16:40:06] not intended to be a pain, and I am aware we are [16:40:36] but it ends up happening mostly for 2 reasons: 1) very special (but justified) needs around performance [16:40:50] 2) lack of people to work on basic stuff [16:41:19] I hope 2 may get better soon [16:41:47] moritzm: https://jynus.com/gif/%3C3.gifv [16:41:56] 10DBA, 10Cleanup, 10Data-Services, 10cloud-services-team (Kanban): Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10bd808) [16:46:07] sure :-) the mid term goal for standising recipes will probably/hopefully addresss both as well: 1. make the standard recipes flexible enough for more special use cases like DBs and 2. centralise the maintenance a bit so that not every sub team has to dive into the partman crap :-) [17:31:21] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10leila) @Marostegui please go ahead. We can handle a few sec potential down for recommendationapi. (@bmansurov FYI)