[00:06:45] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10bd808) I know that the backend query killer (pt-kill) was set to be more aggressive while the cluster was operating at a lower capacity due to an instance with corrupt data (h... [06:25:57] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) Query killer is indeed back to its normal running times. 300 seconds for web service (labsdb1009 and labsdb1010) and 14400 seconds for analytics (labsdb1011): ` 4... [06:33:09] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) Hmm I can see that quarry s indeed running on web rather than analytics. @Bstorm I see in the [[https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL|SA... [06:59:56] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [07:23:18] So I was thinking on generating more shapshots and/or storing them on bacula for more long term [07:23:49] any suggestion what would be useful for provisioning? E.g. daily snapshots? [07:25:47] daily would definitely be very useful [07:25:54] for a week? [07:26:32] Do we have space for that? that means 8 daily snapshost [07:26:34] (retention-wise) [07:26:51] well, probably we would have to skip 1 for do dumps [07:26:56] *to [07:27:06] I think dumps are not that useful on a daily basis [07:27:20] i mean so dumps can run [07:27:27] with no overlap [07:27:41] But do we have space for all that if we include 8 daily snapshots for a week? [07:27:46] if we do, then definitely [07:27:47] when backup1002 arrives [07:27:55] we should have more space [07:28:05] I would have to make some calculations [07:28:11] then +1 to go for daily snapshots for a week if we have space [07:28:15] that'd be very useful I think [07:28:22] there's one caveat [07:28:36] More chances to run into race conditions if we run alter tabels [07:28:37] tables [07:28:38] current snapshots are 1.3T [07:28:39] and xtrabackup [07:29:31] well, yes, but also more chances it where there are no race conditions [07:29:42] yeah, and less worry if one fails [07:29:47] cause you have another recent one [07:29:58] we can maybe do something around it, like a lock or something [07:30:36] alerts are frequent in general, but not convinced they are for a single host on a single wiki [07:30:46] to be a blocker [07:31:45] so we have 2 snapshots [07:31:52] and we have space for 4 more [07:32:14] + ongoing [07:32:42] so 6 stored and 1 ongoing? [07:32:52] something like that [07:33:00] there is also growth projections, etc. [07:33:07] we have to keep in mind that some stuff will be reduced too, like wb_terms [07:33:09] but yeah [07:33:11] we are on the edge [07:33:16] still, 4 more is also good [07:33:26] as we can always skip for instance, sun [07:33:45] maybe I can schedule them daily except on tuesdays (dump) [07:34:01] and increase the retention just a bit for now [07:34:07] and yes, skip some [07:34:12] let's say [07:34:55] M,W,T,F + retention 6 days [07:35:21] yeah, that's good [07:35:22] T dumps [07:35:33] If we keep in mind that effectively, they'd be used a lot for provisioning [07:35:39] and it is unlikely we'd need to provision during weekends [07:36:11] although I may need space for binlogs too [07:36:24] Yeah, I was going to ask if you were counting with that space in your calculations [07:36:26] but I think that is only 2 more retained [07:37:04] let me send a patch, see how it goes, we can change it later [07:37:12] cool! [07:37:36] before you "go" can you do a check on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/577460/ ? [07:43:23] thanks <3 [07:43:31] https://i.imgflip.com/3rl9tn.jpg [07:43:41] hahaha [07:55:16] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2085.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003060755_... [07:57:18] not a rush, maybe for next week: https://gerrit.wikimedia.org/r/c/operations/puppet/+/577462 [08:06:31] maybe I should ask for budget for next year for an extra dbprov host, thinking about commons growth and things we don't know yet about [08:16:27] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2085.codfw.wmnet'] ` and were **ALL** successful. [08:22:27] jynus: not a bad idea indeed [08:22:32] at least to allocate it just in case [08:42:27] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [08:52:27] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:57:45] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:27:32] marostegui: I was looking at the smartd + buster task, is it ok for me to poke smartd on db1078 ? [09:27:42] godog: yep, thanks [09:27:56] ack [09:28:40] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2084.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003060928_... [09:48:51] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2084.codfw.wmnet'] ` and were **ALL** successful. [10:14:53] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [10:15:10] meh, a little bummed re: smartd but not a huge deal IMHO, I've updated the task [10:15:22] brb [10:16:21] thanks, I will take a look! [10:22:36] on the good news, if we are the ones bumping into those issues, we may not be that far away from the curve on buster upgrades [10:22:49] :) [10:24:08] heheh for sure not for hp hosts [10:24:21] only four so far with buster across the fleet [10:24:24] centrallog2001.codfw.wmnet,cloudmetrics[1001-1002].eqiad.wmnet,db1078.eqiad.wmnet [10:24:34] Almost in the podium! [10:43:31] would you see ok to upload 10.1.44? [10:44:00] sure yeah [10:44:10] running for 2 weeks on some hosts already [10:44:37] yeah, go for it [10:46:59] https://phabricator.wikimedia.org/P10649 [10:47:52] \o÷ [10:47:53] Thanks! [14:13:09] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) mydumper-like tool to wrap mysqldump for ES incrementals: ` $ ./backup_es_incremental.py --help usage: backup_es_increment... [14:16:16] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) Nice one! :-) If I can suggest (doesn't have to be for this iteration, of course), maybe you can create a `--dry-`run or... [14:27:50] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) Could you elaborate en what would `--dry-run`would do or skip? * Read the file/original backup * Connect to the database *... [14:31:21] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) Sure thing! The scenario I have in mind is: I need to recover a specific incremental because I have to examine several... [14:51:50] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) I see, so mostly interested around examining existing backups. I think I would solve your use case this way (not exactly th... [15:05:44] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) That sounds good to me. I don't have any strong opinion on whether `--examine` should go to `analyze_dump.py` or to `rec... [16:58:38] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10bd808) >>! In T246970#5947203, @zhuyifei1999 wrote: > Can I revert to analytics? Yes. No matter why it was switched 8 months ago (?) it should be running against the analytic... [18:00:17] 10DBA, 10Cloud-Services, 10CPT Initiatives (Developer Portal): Prepare and check storage layer for dev.wikimedia.org - https://phabricator.wikimedia.org/T246946 (10Pchelolo) [21:43:03] 10DBA, 10Product-Infrastructure-Team-Backlog: DBA review for the PushNotifications extension - https://phabricator.wikimedia.org/T246716 (10Mholloway) @Marostegui Yes, target for release is end of Q4. I think this will involve all new tables.