[04:06:32] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257983 (10Marostegui) 05Open→03Invalid [04:06:35] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) [04:31:37] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) @Jclark-ctr everything done from your side? I see the host is back up. What was done in the end? [05:04:47] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [05:05:54] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [05:06:11] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Marostegui) [06:13:41] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10Marostegui) Switchover was done successfully. We had 69 read-only errors only. RO started at 06:01:26 RO stopped at 06:01:45 Total read-only... [06:23:01] es4 backups broke [06:23:53] kormat: did you upgrade es1022 recently? [06:24:28] crap, yes [06:24:42] he he did you follow procedure fully ;-) [06:25:06] i'm going to say no :) [06:25:29] this looks like the last line of https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.4_known_issues [06:25:33] but let me check [06:25:35] yeah, looking at that now [06:26:14] don't worry, I can take care, but if it ends up being that a heads up that it is documented there [06:26:24] ack, thanks [06:27:31] added to my notes so i don't keep forgetting about it [06:28:13] set sql_log_bin=0; REVOKE DELETE HISTORY ON *.* from 'dump'@'10.64.32.107'; [06:28:24] ^I just run that and will force a dump retry [06:28:30] and that should fix it [06:29:14] I think the backup technically doesn't break, but because it shows fatal errors on log, the backup is considered a failure [06:29:21] so it is ignored [06:31:07] es4 backups running now, looking good so far [07:05:58] cool :) [07:12:02] marostegui: replication is stopped on db2098 - is that known? [07:12:17] I think that's a backup source [07:12:30] so yeah., expected [07:12:37] (if it is one and there's a backup running) [07:12:51] ahh. i see. [07:12:54] it is one, [07:13:15] and yeah, there's a backup running [07:13:30] i'll ack the alert on icinga i guess [07:17:24] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [07:17:57] kormat: do you want to do that everyday? :-P [07:18:49] volans: i'm not sure i want to do _anything_ with icinga every day :P [07:19:40] s/with icinga// [07:19:52] well, that too :) [07:44:49] 10DBA, 10Operations, 10SRE-tools, 10Patch-For-Review, 10User-Kormat: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10Kormat) 05Open→03Resolved Ok, as the basic module is now in place (with unit tests!), i'm going to close this task in favour of smaller-scoped ones... [08:25:50] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1120.eqiad.wmnet'] ` The log... [08:31:51] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10daniel) [08:43:12] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1120.eqiad.wmnet'] ` and were **ALL** successful. [08:48:22] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10Marostegui) [08:50:51] 10DBA, 10Operations, 10User-Kormat: Add monitoring to ensure consistency between puppet and zarcillo - https://phabricator.wikimedia.org/T257821 (10Kormat) [08:51:02] 10DBA, 10Operations, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10Kormat) [08:51:12] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Kormat) [09:16:40] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) > we've not had any issues anymore I honestly don't trust tendril, we said many times "Issues seems now fixed/mitigated" and they end up coming back. I think for performance reasons we... [09:25:08] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) Alternatively, we can request virtual machines on production for both dc instances and that way we can easily separate both services so they don't interact, until tendril goes away. [09:33:46] Interesting stuff: https://mysqlserverteam.com/the-mysql-8-0-21-maintenance-release-is-generally-available/ [09:41:21] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) >>! In T257816#6307674, @jcrespo wrote: >> we've not had any issues anymore > > I honestly don't trust tendril, we said many times "Issues seems now fixed/mitigated" and they end up... [09:42:31] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) > We cannot be 100% sure that wherever we host zarcillo will always be up, especially if shared with more stuff. Hence see my last comment. [09:45:01] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) That also means introducing even more infra - which would also be different from the rest (VMs) - why not trying to make the retrying process a bit easier or auto-healing (maybe even... [10:20:04] 10DBA, 10Patch-For-Review: Create more tests for transferpy package - https://phabricator.wikimedia.org/T257600 (10Aklapper) [10:27:05] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [10:27:08] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10Marostegui) 05Open→03Resolved All done, x1 fully running Buster and MariaDB 10.4 [10:39:59] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) > let's tackle that I would say I am sorry that hosting the backup logs database was such overhead, I honestly thought it was much less resource intensive for DBAs. I will ask for reso... [10:46:41] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) I have never said it is an overhead and you know very well it is not resource intensive - my point is: let's try not have more special cases and let's try to have things as consisten... [11:04:28] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) I've created T258045 for the backups database. You can freely decide about zarcillo now. [11:04:43] seriously jaime? [11:09:52] what's the issue? [11:11:09] the issue is that you understood what you wanted to understood on that task [11:11:11] but nevermind [11:11:24] Looks like you already have a plan [11:14:29] I am following what we decided on meeting- move zarcillo to misc [11:14:38] no, that wasn't decided [11:14:51] I cannot decide for zarcillo, but I think it is good to decouple zarcillo inventory [11:15:11] from backups, for which them being on zarcillo was kinda tacked on [11:15:32] I think It makes sense to have zarcillo algonside tendril [11:15:38] I am agreeing with you [11:16:00] but let me move backups to m1, which is more stable? [11:16:01] I don't think you are agreeing with me [11:16:27] and it agrees with everything you said: standarized resources for it [11:16:47] no more special "backup host" [11:16:57] it is just m1, I am trying to be contructive here really [11:17:01] The conversation on the task has moved from let's discuss to: sorry for having a database which is a pain (I have never said that) and then me replying that it is not and you finishing the discussion with "I will move this to m1" [11:17:45] no, I can see it being "special", and it was [11:17:50] In the meeting we talked about m1, but we never decided, we said that we could enable replication for now and we can discuss later [11:17:58] I am not married to m1 [11:18:29] but I think it should be decoupled from zarcillo [11:18:45] so I don't say we decided m1 [11:18:58] I said we flirted with the idea of it being a "proper" misc [11:19:05] I think it was you who proposed that? [11:19:37] It was me, but we agreed not to do it for now [11:19:38] splitting the easy part (backups) out, will make less worried [11:19:44] ok, I missed that [11:20:00] but maybe if it is decoupled, it makes more sense? [11:20:17] Is there a hard reason to keep backup logs AND zarcillo toghether? [11:20:43] I don't know, that subject just came up by surprise [11:20:56] because I may sound hard, but once that is our of the equation, I don't care that much about how zarcillo ends up [11:21:04] I don't mean in an unrespectful way [11:21:29] I don't know how to express, hey, let me do this, which will make me less worried [11:21:51] https://phabricator.wikimedia.org/T257816#6307873 this comment did sound very passive aggressive, especially cause you know it either not resource intensive or an overhead [11:21:53] because I don't worry about the rest of zarcillo being there [11:22:28] And my point wasn't the short term, but the long term, which is making your life easier, without having to worry as much about the retries [11:22:36] sorry, I got disconnected [11:22:39] let me check log [11:23:05] let me try to be constructive [11:23:13] by elaborating on my train of thought [11:23:37] there is a change needed that we generally agree with all [11:23:42] the ticket itself, right? [11:24:01] I am worried about the impact on backups... because I am the backups guy :-D [11:24:26] I don't answer the "let's fix X instead" because that is an offtopic conversation [11:24:32] IMHO [11:24:37] you started the offtopic conversation [11:24:48] for me it is ontopic [11:24:53] because it impacts backups [11:24:53] My point is that moving it to m1 won't address the problem of costly retries [11:24:59] you'll have the same issues there if that happens [11:25:19] My point across the ticket is replying to what you started: tendril crashing and having an overhead on your work [11:25:33] ok, I (repectfully) disagree [11:25:43] and your solution is to move it to m1 which doesn't really fix it - ie: m1 master crashed past week [11:25:47] but because I think backups are indeed not taht important [11:25:49] for the ticket [11:25:57] I offer to move them independently [11:25:58] my point is: let's try to allocate time to fix that so your life is a bit easier [11:26:02] so I am no longer a blocker [11:26:15] that's all [11:26:59] there is a reason to not put tendril on m1 [11:27:17] no one talked about moving tendril to m1, that's impossible and you know that [11:27:18] but unless you have a good reason not to put backup metadata on m1, it is a win-win! [11:27:32] unless I am not understanding? [11:27:57] s/m1/ whatever other place you prefer/advice [11:28:56] I am trying to find a way that workarounds both your issues and mines? [11:29:07] but I may have missunderstood [11:29:33] backup metadata is a "regular misc db", it is not a special db like tendril or zarcillo [11:29:52] it would in fact live toghether with the bacula db [11:30:11] sure, but you'll still have the same problems [11:30:15] but ok, move them to m1 [11:30:32] I don't really want to keep discussing this, looks like we are on different pages on how the conversation went on the task really [11:32:38] but is there a technical reason agains my proposal? [11:33:24] again: you'll have the same issue as you'd have in any other place, which was your main concern on the task [11:33:38] and which started this whole discussion [11:34:01] so, from my point of view, that solves my only objection to the original ticket [11:34:30] that's not how I read it, but if that's what solves your problem, then sure, go ahead [11:35:27] but maybe we had a missunderstanding in the middle of the discussion [12:20:50] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) @marostegui Yes all items finished sorry for not commenting. Dell did not come till very late yesterday [12:21:44] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) Thanks @Jclark-ctr - just for the record in case this host has future issues, was the mainboard and DIMM modules replaced as well as the hard disk? [12:38:31] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) The RAID looks good ` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-... [13:50:37] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) a:05jcrespo→03akosiaris A clone of the otrs database has been setup on db1077. The question now, @akosiaris, is what needs acce... [14:28:55] 10DBA, 10Operations, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) So it turns out that work on T257816 unveiled that there were a lot of hardcoded endpoints that made that task, not only an option, but a requirement to acheive this one. More work will... [14:48:09] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10akosiaris) >>! In T257928#6308424, @jcrespo wrote: > A clone of the otrs database has been setup on db1077. The question now, @akosiaris, is... [14:51:41] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) Np. More questions: do we setup it with a separate user/password (to avoid mistakes with the production db) or the same (for conven... [14:55:23] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10akosiaris) >>! In T257928#6308662, @jcrespo wrote: > Np. More questions: do we setup it with a separate user/password (to avoid mistakes wit... [15:02:43] 10DBA, 10OTRS, 10Operations, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) a:05akosiaris→03jcrespo [15:51:19] 10DBA, 10Cloud-Services: Prepare and check storage layer for avkwiki - https://phabricator.wikimedia.org/T258077 (10Urbanecm) [19:20:56] 10DBA, 10Core Platform Team: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [19:27:44] 10DBA, 10Core Platform Team: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [19:28:43] 10DBA, 10Core Platform Team: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [20:11:48] 10DBA, 10Core Platform Team: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [20:43:09] 10DBA, 10Core Platform Team: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [20:47:00] 10DBA, 10Core Platform Team: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [20:51:57] 10DBA, 10Core Platform Team, 10Growth-Team, 10MediaWiki-Watchlist: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) [21:14:02] 10DBA, 10Core Platform Team, 10Growth-Team, 10MediaWiki-Watchlist: Purge unused watchlist rows - https://phabricator.wikimedia.org/T258098 (10MusikAnimal) Other ideas... make use of #expiring-watchlist-items automatically. Say, when a page is deleted, MediaWiki can set the watchlist expiry to 5 years. We c...