[01:44:16] <wikibugs>	 10DBA, 10Data-Persistence: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Huji) @Marostegui quick ping that an update as of Oct 20th would be in order.
[03:54:41] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) They emailed me and required I upload the AHS log via a https drop box utility, so I did so along with the IML log file.  Awaiting reply from HP support.
[05:11:56] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Marostegui) p:05Triage→03Medium
[05:21:21] <wikibugs>	 10DBA, 10Data-Persistence: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) @Huji thanks for the ping. I have a calendar alert for this, but yesterday I was super busy and I couldn't do it, but it is on my radar.
[05:26:24] <wikibugs>	 10DBA, 10Data-Persistence: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui)
[06:04:29] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for smnwiki - https://phabricator.wikimedia.org/T264900 (10Marostegui) a:05Marostegui→03None
[08:49:06] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Sorry for the late response, it was very late on our TZ.  Apologies also for not using the template, I was not aware of it existence, at least I've never seen it used before. I kn...
[09:05:56] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat)
[09:06:02] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) p:05Triage→03Medium
[09:17:45] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) Adding `profile::idp::client::httpd`, and configuring orchestrator appropriately should work.
[09:23:28] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) 11:21:49 <jbond42> kormat: if thats that case i would use the header X-CAS-CN (environment variable HTTP_X_CAS_CN) as the default CAS-User header suffers from the case insensetive issue that i...
[09:46:50] <jynus>	 so after removing snapshots from long term bacula backups, we are back to an almost 90 day retention
[09:46:59] <jynus>	 which is where we want to be
[09:48:03] <jynus>	 11 weeks
[10:21:45] <wikibugs>	 10DBA, 10Operations, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10MoritzMuehlenhoff)
[10:27:21] <jynus>	 https://speakerdeck.com/shlominoach/vitess-online-schema-migration-automation
[10:28:11] <jynus>	 ^slide 43 is particulary interesting
[10:46:26] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Marostegui)
[11:09:47] <wikibugs>	 10DBA, 10Data-Persistence, 10User-Kormat: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 (10Marostegui) Upgraded db2093 from 10.4.12 to 10.4.15 Rebooted it to pick the new kernels too.
[11:56:55] <marostegui>	 jynus: do we backup tendril events when we do tendril logical backups?
[11:57:37] <jynus>	 we don't backup tendril at all, we just backup zarcillo
[11:57:44] <marostegui>	 not even the schemas?
[11:57:50] <marostegui>	 I thought we backuped the schemas
[11:58:15] <jynus>	 we have a copy of the schemas somewhere, but we couldn't run mydumper on that db without bringing it down
[11:59:04] <marostegui>	 Nothing has happened btw, I was talking to Stevie about tendril and I thought: do we backup the events creation syntax?
[11:59:35] <jynus>	 so formally we have no backups of tendril at all
[11:59:45] <jynus>	 we made some offline copies sometimes
[12:00:16] <marostegui>	 maybe we should try to backup the non-host tables
[12:00:19] <marostegui>	 those can be regenerated
[12:00:41] <jynus>	 last time I tried I could do nothing because metadata locking
[12:00:55] <jynus>	 but if you find a way, I will be happy to set it up
[12:01:10] <marostegui>	 We can start tendril listening on localhost maybe, and try it
[12:01:15] <marostegui>	 it doesn't have to be now or this week
[12:01:22] <jynus>	 no
[12:01:29] <jynus>	 the problem is the data dictionary
[12:01:42] <marostegui>	 I mean for the non-host tables
[12:01:42] <jynus>	 plus the remote tables it uses
[12:01:46] <jynus>	 it gets all wonky
[12:01:53] <jynus>	 yeah, but even if those are not backed up
[12:02:00] <jynus>	 the tool checks the metadata automatically
[12:02:07] <marostegui>	 Which tool?
[12:02:09] <jynus>	 and bad things started happening
[12:02:12] <jynus>	 mydumper
[12:02:24] <marostegui>	 And if we past the list of tables manually?
[12:03:09] <jynus>	 as I said, if you find a way, I can help, but I was unable to make it work
[12:03:13] <jynus>	 I tried a few things
[12:03:28] <marostegui>	 ok, nevermind
[12:03:53] <jynus>	 I think the backup strategy we settled was to "copy it from the codfw node"
[12:04:19] <marostegui>	 I am checking the codfw node, and it is way out of sync, plus it has errors (10.4 and no tokudb, so the definition of the tables is broken etc)
[12:04:46] <jynus>	 what is the node name?
[12:05:07] <kormat>	 db2093
[12:05:08] <marostegui>	 db2093
[12:07:51] <marostegui>	 jynus: The reason I am asking to see if we can backup the non host tables is because those are the only ones that cannot be regenerated, as the per-host ones can be regenerated using the tendril-add scripts and all that
[12:07:55] <marostegui>	 Which would enable the events too
[12:08:03] <marostegui>	 But the other ones, those I don't think we have anywhere
[12:08:17] <marostegui>	 Especially global_status_log and global_status_log_5m or something like that, which are key for tendril
[12:08:39] <marostegui>	 Even creating a .sql file with the table definition could work
[12:08:50] <marostegui>	 And placing it on the tendril's repo
[12:09:43] <jynus>	 we cannot with current tooling
[12:10:17] <jynus>	 maybe you can try doing it manually or finding a way it can work?
[12:10:30] <marostegui>	 ok, thanks
[12:10:32] <jynus>	 it is very difficult
[12:10:37] <jynus>	 with tokudb
[12:10:43] <jynus>	 plus lots of write activity
[12:10:49] <jynus>	 plus the large metadata issue
[12:10:58] <jynus>	 I tried but I was unable to do it
[12:11:37] <jynus>	 I belive there was an old structure one-time backup somewhere
[12:13:13] <jynus>	 "Even creating a .sql file with the table definition could work"
[12:13:24] <jynus>	 -> but that should exist already on the tendril repo
[12:14:02] <wikibugs>	 10DBA, 10Operations, 10User-Kormat, 10User-jbond: mariadb::config: parameterize event_scheduler - https://phabricator.wikimedia.org/T266119 (10Kormat)
[12:14:19] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: mariadb::config: parameterize event_scheduler - https://phabricator.wikimedia.org/T266119 (10Kormat) p:05Triage→03Medium
[12:16:24] <jynus>	 I am checking the repo and apparently that doesn't exist
[12:16:41] <jynus>	 so it will have to be reversed-engineering
[12:17:10] <jynus>	 I don't think Sean thought of suporting "installing tendril" :-D
[12:19:17] <jynus>	 Re: "do we backup the events creation syntax" that is on the repo
[12:19:36] <jynus>	 but what is not on the repo is the table structure for shared tables
[12:19:40] <mark>	 i think he probably did, but it was part of his shortcuts at the time, as we've all had to make them ;)
[12:19:49] <jynus>	 yeah
[12:20:34] <jynus>	 not blaming him at all
[12:21:51] <jynus>	 I am trying to find where we put the one time backup
[12:22:34] <marostegui>	 Don't worry, I am sending a CR with the empty table schemas
[12:22:38] <marostegui>	 Which is good enough for me for now
[12:23:11] <jynus>	 how did you get it?
[12:23:23] <marostegui>	 With mysqldump
[12:23:30] <marostegui>	 Sending the list of tables
[12:23:39] <jynus>	 and it didn't break the live site?
[12:23:55] <marostegui>	 No, it didn't
[12:23:57] <jynus>	 because I was afraid of that
[12:25:47] <sobanski>	 Side question, does it make sense to bring the codfw node up to date?
[12:25:59] <marostegui>	 I am going to merge https://gerrit.wikimedia.org/r/635535
[12:27:05] <marostegui>	 sobanski: It doesn't support tokudb, so we'd need to conver them to innodb first, but it shouldn't take long if we really need. Having the global tables somewhere is good I think
[12:27:43] <marostegui>	 sobanski: We can create a task for that if needed, but I think it is low priority
[12:27:45] <jynus>	 yeah, I think the initial idea was to copy from them on event of an issue, but then we had to fight with all the blockers you know
[12:28:12] <sobanski>	 No point then, thanks for the explanation.
[12:29:05] <jynus>	 there is in fact, already a ticket: https://phabricator.wikimedia.org/T249085
[12:29:36] <jynus>	 the parent one was the detail of the status until we got blocked: T224589
[12:29:37] <stashbot>	 T224589: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589
[12:32:22] <jynus>	 but all that work stopped
[12:51:02] <jynus>	 there are a few hosts that have a "profiling" memory table, I will ask on T265323 if anyone knows about it, but asking here in case someone (probably non-dbas) knows about it?
[12:51:03] <stashbot>	 T265323: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323
[12:51:09] <jynus>	 not that ticket
[12:51:23] <jynus>	 this one T54921
[12:51:24] <stashbot>	 T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921
[12:53:23] <jynus>	 I found "@tstarling made this as a temporary copy of the profiling table"
[12:54:23] <jynus>	 but I cannot find that on tables.sql?
[12:54:51] <jynus>	 or tables.json
[12:55:05] <jynus>	 I will ask performance team
[12:56:07] <Reedy>	 jynus: it was removed
[12:56:11] <Reedy>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/545308/
[12:56:17] <jynus>	 Reedy: old table?
[12:56:40] <Reedy>	 https://phabricator.wikimedia.org/T231366
[12:56:40] <jynus>	 thanks, very helpful
[12:56:45] <Reedy>	 >As far as I'm aware, this feature has not been in use by either WMF, nor any MW developers, for a long time.
[12:56:47] <Reedy>	 :D
[12:57:05] <jynus>	 I will file a ticket for that
[12:57:23] <jynus>	 will remove it from source backup hosts so it doesn't keep "contaminating" other hosts
[12:57:32] <jynus>	 and will see how many other hosts have it
[12:57:59] <jynus>	 the issue is the table is memory type, which is strange
[13:02:16] <wikibugs>	 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10jcrespo)
[13:07:23] <wikibugs>	 10DBA, 10Data-Persistence-Backup: Drop table profiling from WMF wiki mariadb servers - https://phabricator.wikimedia.org/T266125 (10jcrespo)
[13:08:48] <wikibugs>	 10DBA, 10Data-Persistence-Backup: Drop table profiling from WMF wiki mariadb servers - https://phabricator.wikimedia.org/T266125 (10jcrespo) p:05Triage→03Medium I will take care of dropping it first on the source backups so those don't contaminate other host, other host will have to wait until dc switchbac...
[13:09:29] <jynus>	 I think I may wait to do any drop until the switch dc
[13:11:01] <wikibugs>	 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10jcrespo)
[13:11:25] <jynus>	 ^ Reedy thanks again for the help, let me know if I reflected what you told me accurately
[13:30:57] <jynus>	 In some instances, doing a full check tables takes almost 24 hours
[16:21:41] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) a:05jcrespo→03RobH Jaime: I didn't realize the DB systems hardware repair cadence was different then the other systems (with DBA team only taking it offline immediately before wo...
[16:22:17] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) Oh, if it is a mainboard replacement, the host will need reimage.  I assume if that is the case, it can come offline well in advance as its basically re-entering service as a new hos...
[16:24:19] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) > the host will need reimage  A reimage is not a problem, even with data loss- the problem is being down for an  extended amount of time (e.g. ~1 week).
[16:30:54] <wikibugs>	 10DBA, 10MediaWiki-Parser, 10Parsoid, 10serviceops, 10Platform Team Workboards (Green): CAPEX for ParserCache for Parsoid - https://phabricator.wikimedia.org/T263587 (10WDoranWMF) a:03WDoranWMF
[16:40:33] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH)
[16:41:02] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH)