[01:44:16] 10DBA, 10Data-Persistence: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Huji) @Marostegui quick ping that an update as of Oct 20th would be in order. [03:54:41] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) They emailed me and required I upload the AHS log via a https drop box utility, so I did so along with the IML log file. Awaiting reply from HP support. [05:11:56] 10DBA, 10Operations, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Marostegui) p:05Triage→03Medium [05:21:21] 10DBA, 10Data-Persistence: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) @Huji thanks for the ping. I have a calendar alert for this, but yesterday I was super busy and I couldn't do it, but it is on my radar. [05:26:24] 10DBA, 10Data-Persistence: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) [06:04:29] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for smnwiki - https://phabricator.wikimedia.org/T264900 (10Marostegui) a:05Marostegui→03None [08:49:06] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Sorry for the late response, it was very late on our TZ. Apologies also for not using the template, I was not aware of it existence, at least I've never seen it used before. I kn... [09:05:56] 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [09:06:02] 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) p:05Triage→03Medium [09:17:45] 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) Adding `profile::idp::client::httpd`, and configuring orchestrator appropriately should work. [09:23:28] 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) 11:21:49 kormat: if thats that case i would use the header X-CAS-CN (environment variable HTTP_X_CAS_CN) as the default CAS-User header suffers from the case insensetive issue that i... [09:46:50] so after removing snapshots from long term bacula backups, we are back to an almost 90 day retention [09:46:59] which is where we want to be [09:48:03] 11 weeks [10:21:45] 10DBA, 10Operations, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10MoritzMuehlenhoff) [10:27:21] https://speakerdeck.com/shlominoach/vitess-online-schema-migration-automation [10:28:11] ^slide 43 is particulary interesting [10:46:26] 10DBA, 10Operations, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Marostegui) [11:09:47] 10DBA, 10Data-Persistence, 10User-Kormat: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 (10Marostegui) Upgraded db2093 from 10.4.12 to 10.4.15 Rebooted it to pick the new kernels too. [11:56:55] jynus: do we backup tendril events when we do tendril logical backups? [11:57:37] we don't backup tendril at all, we just backup zarcillo [11:57:44] not even the schemas? [11:57:50] I thought we backuped the schemas [11:58:15] we have a copy of the schemas somewhere, but we couldn't run mydumper on that db without bringing it down [11:59:04] Nothing has happened btw, I was talking to Stevie about tendril and I thought: do we backup the events creation syntax? [11:59:35] so formally we have no backups of tendril at all [11:59:45] we made some offline copies sometimes [12:00:16] maybe we should try to backup the non-host tables [12:00:19] those can be regenerated [12:00:41] last time I tried I could do nothing because metadata locking [12:00:55] but if you find a way, I will be happy to set it up [12:01:10] We can start tendril listening on localhost maybe, and try it [12:01:15] it doesn't have to be now or this week [12:01:22] no [12:01:29] the problem is the data dictionary [12:01:42] I mean for the non-host tables [12:01:42] plus the remote tables it uses [12:01:46] it gets all wonky [12:01:53] yeah, but even if those are not backed up [12:02:00] the tool checks the metadata automatically [12:02:07] Which tool? [12:02:09] and bad things started happening [12:02:12] mydumper [12:02:24] And if we past the list of tables manually? [12:03:09] as I said, if you find a way, I can help, but I was unable to make it work [12:03:13] I tried a few things [12:03:28] ok, nevermind [12:03:53] I think the backup strategy we settled was to "copy it from the codfw node" [12:04:19] I am checking the codfw node, and it is way out of sync, plus it has errors (10.4 and no tokudb, so the definition of the tables is broken etc) [12:04:46] what is the node name? [12:05:07] db2093 [12:05:08] db2093 [12:07:51] jynus: The reason I am asking to see if we can backup the non host tables is because those are the only ones that cannot be regenerated, as the per-host ones can be regenerated using the tendril-add scripts and all that [12:07:55] Which would enable the events too [12:08:03] But the other ones, those I don't think we have anywhere [12:08:17] Especially global_status_log and global_status_log_5m or something like that, which are key for tendril [12:08:39] Even creating a .sql file with the table definition could work [12:08:50] And placing it on the tendril's repo [12:09:43] we cannot with current tooling [12:10:17] maybe you can try doing it manually or finding a way it can work? [12:10:30] ok, thanks [12:10:32] it is very difficult [12:10:37] with tokudb [12:10:43] plus lots of write activity [12:10:49] plus the large metadata issue [12:10:58] I tried but I was unable to do it [12:11:37] I belive there was an old structure one-time backup somewhere [12:13:13] "Even creating a .sql file with the table definition could work" [12:13:24] -> but that should exist already on the tendril repo [12:14:02] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: mariadb::config: parameterize event_scheduler - https://phabricator.wikimedia.org/T266119 (10Kormat) [12:14:19] 10DBA, 10Operations, 10User-Kormat: mariadb::config: parameterize event_scheduler - https://phabricator.wikimedia.org/T266119 (10Kormat) p:05Triage→03Medium [12:16:24] I am checking the repo and apparently that doesn't exist [12:16:41] so it will have to be reversed-engineering [12:17:10] I don't think Sean thought of suporting "installing tendril" :-D [12:19:17] Re: "do we backup the events creation syntax" that is on the repo [12:19:36] but what is not on the repo is the table structure for shared tables [12:19:40] i think he probably did, but it was part of his shortcuts at the time, as we've all had to make them ;) [12:19:49] yeah [12:20:34] not blaming him at all [12:21:51] I am trying to find where we put the one time backup [12:22:34] Don't worry, I am sending a CR with the empty table schemas [12:22:38] Which is good enough for me for now [12:23:11] how did you get it? [12:23:23] With mysqldump [12:23:30] Sending the list of tables [12:23:39] and it didn't break the live site? [12:23:55] No, it didn't [12:23:57] because I was afraid of that [12:25:47] Side question, does it make sense to bring the codfw node up to date? [12:25:59] I am going to merge https://gerrit.wikimedia.org/r/635535 [12:27:05] sobanski: It doesn't support tokudb, so we'd need to conver them to innodb first, but it shouldn't take long if we really need. Having the global tables somewhere is good I think [12:27:43] sobanski: We can create a task for that if needed, but I think it is low priority [12:27:45] yeah, I think the initial idea was to copy from them on event of an issue, but then we had to fight with all the blockers you know [12:28:12] No point then, thanks for the explanation. [12:29:05] there is in fact, already a ticket: https://phabricator.wikimedia.org/T249085 [12:29:36] the parent one was the detail of the status until we got blocked: T224589 [12:29:37] T224589: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 [12:32:22] but all that work stopped [12:51:02] there are a few hosts that have a "profiling" memory table, I will ask on T265323 if anyone knows about it, but asking here in case someone (probably non-dbas) knows about it? [12:51:03] T265323: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 [12:51:09] not that ticket [12:51:23] this one T54921 [12:51:24] T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 [12:53:23] I found "@tstarling made this as a temporary copy of the profiling table" [12:54:23] but I cannot find that on tables.sql? [12:54:51] or tables.json [12:55:05] I will ask performance team [12:56:07] jynus: it was removed [12:56:11] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/545308/ [12:56:17] Reedy: old table? [12:56:40] https://phabricator.wikimedia.org/T231366 [12:56:40] thanks, very helpful [12:56:45] >As far as I'm aware, this feature has not been in use by either WMF, nor any MW developers, for a long time. [12:56:47] :D [12:57:05] I will file a ticket for that [12:57:23] will remove it from source backup hosts so it doesn't keep "contaminating" other hosts [12:57:32] and will see how many other hosts have it [12:57:59] the issue is the table is memory type, which is strange [13:02:16] 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10jcrespo) [13:07:23] 10DBA, 10Data-Persistence-Backup: Drop table profiling from WMF wiki mariadb servers - https://phabricator.wikimedia.org/T266125 (10jcrespo) [13:08:48] 10DBA, 10Data-Persistence-Backup: Drop table profiling from WMF wiki mariadb servers - https://phabricator.wikimedia.org/T266125 (10jcrespo) p:05Triage→03Medium I will take care of dropping it first on the source backups so those don't contaminate other host, other host will have to wait until dc switchbac... [13:09:29] I think I may wait to do any drop until the switch dc [13:11:01] 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10jcrespo) [13:11:25] ^ Reedy thanks again for the help, let me know if I reflected what you told me accurately [13:30:57] In some instances, doing a full check tables takes almost 24 hours [16:21:41] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) a:05jcrespo→03RobH Jaime: I didn't realize the DB systems hardware repair cadence was different then the other systems (with DBA team only taking it offline immediately before wo... [16:22:17] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) Oh, if it is a mainboard replacement, the host will need reimage. I assume if that is the case, it can come offline well in advance as its basically re-entering service as a new hos... [16:24:19] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) > the host will need reimage A reimage is not a problem, even with data loss- the problem is being down for an extended amount of time (e.g. ~1 week). [16:30:54] 10DBA, 10MediaWiki-Parser, 10Parsoid, 10serviceops, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10WDoranWMF) a:03WDoranWMF [16:40:33] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) [16:41:02] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH)