[04:35:50] sobanski: +1 to create a task for CPU saturation dashboard but not for the s8 response time one. there's not much to dig there, it was kinda expected to have issues with S1 (enwiki) and s8 (wikidatawiki) right after the switch for a few minutes as the load shift is huge for those two [04:37:07] jynus: yeah, we have many of those non actionable tasks, I would love to get rid of them really. I proposed that years ago and it was rejected [04:37:46] sobanski: +1 to the cumin aliases task too! [05:01:47] 10DBA, 10SRE-tools, 10conftool, 10serviceops, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Marostegui) p:05Triage→03High [06:02:11] 10DBA, 10SRE-tools, 10conftool, 10serviceops, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Joe) restarting all confds before switching DC seems overkill and frankly useless. We should... [07:06:55] 10DBA, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Drop openstack databases from m5-master - https://phabricator.wikimedia.org/T261152 (10Marostegui) This has been executed: ` drop user if exists 'glance'@'%'; drop user if exists 'keystone'@'%'; drop user if exists 'neutron'@'%'; drop... [07:07:10] arturo: I have executed this on m5: https://phabricator.wikimedia.org/T261152#6429284 it is supposed to drop unused grants as the databases were removed, but if you see something broken on m5, please let me know. I have a copy of the grants in case we need to revert [07:38:40] ack, thanks marostegui [07:40:38] 10DBA, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Drop openstack databases from m5-master - https://phabricator.wikimedia.org/T261152 (10Marostegui) 05Open→03Resolved [07:55:21] marostegui, kormat : sorry I might have missed some context from yesterday, what aliases are missing? I felt that with the existing ones you could select all that's needed [07:55:43] volans: i have to run now, i'll get back to you on that [07:55:49] sure, no hurry [08:55:38] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [09:00:04] I did x1 maintenance early in the morning [09:00:53] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) @Andrew @CDanis when doing the initial step of pooling the future wikitech master with weight 0 (https://phabricator.wikimedia.org/P1243... [09:01:53] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [09:25:43] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [10:06:02] 10DBA, 10SRE-tools, 10conftool, 10serviceops, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo) [11:10:13] volans, sobanski: re: cumin aliases. when all the db masters started paging, i wasn't confident that i could craft cumin patterns to match only the relevant servers [11:10:26] so i ended up just sshing to all of the servers to run puppet manually [11:10:52] i need to audit the existing cumin aliases to see exactly what's in them [11:11:42] e.g. doing `A:db-role-master and (A:db-section-s1 or A:db-section-s2 or A:db-section-s3...)` just isn't manageable [11:16:54] kormat: and A:db-core-eqiad [11:16:57] I was thinking [11:17:18] or A:db-all-eqiad [11:17:20] based on necessity [11:19:10] `A:db-role-master and A:db-all-eqiad and not A:db-misc-servers` might have worked [11:19:31] A:db-core-eqiad was not enough? [11:19:40] missing misc? [11:19:46] volans: that wouldn't have touched parsercache, for example [11:20:22] that have db-parsercache [11:20:31] volans: sure. [11:20:38] but yeah it's harder on the spot [11:20:51] i'm not claiming it wasn't possible to craft the pattern. i just didn't have the confidence i could do it correctly under pressure. [11:21:01] and also i think a bit of cleanup of the aliases wouldn't hurt [11:21:19] sure, the othe side of the coin is that more aliases are harder to remember [11:21:25] there's no way to select all DBs, for example, without doing `A:db-all-codfw or A:db-all-eqiad` [11:21:32] I find myself often doing grep on /etc/cumin/aliases.yaml [11:21:59] +1 for a cleanup/refactor [11:22:07] `db-misc` vs `db-misc-servers` confuses me [11:22:55] indeed [11:23:05] s/misc/other/ ? [11:23:16] we use the misc word in that acception elsewhere [11:23:21] but for dbs we have also misc dbs [11:23:24] so that's confusing for sure [11:23:37] like dbmonitor vs debmonitor and others :-P [11:29:41] volans: how strongly do you feel about keeping cumin aliases sorted? [11:31:14] kormat: what are you proposing? I think the manual ones should be sorted for easy of use, the ones generated in loops by erb doesn't matter I guess [11:31:35] also, if you want to be nerdsniped, I have somehwere 80% of a prototype of a bash completion for aliases [11:31:48] volans: i was going to sort the types by importance, but it's a minor thing [11:33:05] maybe we should have db-parsercache, db-core (s1-s8, and es), db-misc (anything that belongs to mX) [11:33:33] marostegui: where does x1 fit? [11:33:39] kormat: on db-core [11:33:43] ok [11:33:44] kormat: also, within the db- prefix I don't really care honestly, whatever works for you better [11:33:56] alright. i'll send a CR in a bit [11:44:27] one goal is that you should never need to know the exact role or profile names for common operations. [11:52:55] 10DBA, 10User-Kormat: Enable DB replication codfw -> eqiad before the switchover and some other checks - https://phabricator.wikimedia.org/T243373 (10Marostegui) I have checked that replication works fine from codfw to eqiad by checking a few tables, sections checked: - pc1,pc2, pc3 - es4, es5 - s1 enwiki - s... [12:27:29] kormat: A:DTRT :-P [12:28:37] i'll file a wishlist issue against cumin for that :) [12:31:53] lol [12:38:40] marostegui: there's an entry in gcal for "maintenance can start on codfw" for tomorrow morning. should that be "on eqiad" by any chance? :) [12:38:52] oh yes :) [12:38:54] let me fix it [12:39:14] fixed, thanks :* [12:42:07] volans: is there a way to tell cumin to use a custom aliases file? [12:42:40] ah, same dir, ok. https://cumin.readthedocs.io/en/stable/configuration.html#aliases-yaml [12:42:58] kormat: yes and no, you can use a custom config file and that's defining the path [12:43:01] of the aliases [12:43:04] yeah [12:52:22] volans: actual feature request for cumin: add a flag so output the nodelist on stdout [12:52:32] so that another tool can iterate over them [12:52:42] s/flag so/flag to/ [12:53:18] kormat: btw we have nodeset (the binary from clustershell) installed on teh cumin hosts [12:53:21] to play with those [12:54:06] yes - the issue is that cumin prints the nodeset + 2 other lines to stderr, which makes it painful to play with [12:56:39] for example - right now it would be neat if i could write a script to call cumin and then iterate over the nodeset to check that zarcillo agrees that all of the nodes are in section X [12:57:36] but nodes in section doesn't make much sense... instances are in sections, not servers [12:58:05] cumin is a good node orchestrator of nodes, not so much of instances [12:59:20] jynus: with the new `db-multiinstance` alias in my CR we can work around the instance vs server distinction [13:01:14] it is hard, there is no simple "cumin A:all 'SELECT 1'" [13:02:58] I am guessing it will become something like cumin A:-... and db:multiinstance 'while /run/mysqld/*; do ... mysql -S ...;done' ? [13:04:32] that will at least be possible now. ideally we'll have cookbooks to automate the common tasks, but sometimes you just want an adhoc way to do things [13:05:58] kormat: https://etherpad.wikimedia.org/p/volans-tmp starting line 23 [13:06:36] volans: hah :) [13:06:54] yeah i can probably make a small wrapper python script to do what i want right now [13:10:56] kormat: but please do it, I think we could also just print to stderr the resulting hosts if no command is passed [13:11:10] as right now they are sent to stderr because only the command output is sent to stdout [13:13:01] one task for you, as requested :) [13:14:03] I assumed you also wanted to do the onors to make the patch :-P [13:14:10] *honors [13:14:22] and be part of the very restricted elite list of cumin's committers [13:15:14] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) Reserved failover window on the deployment's calendar [13:15:41] volans: ah haha. such an honor :) [13:16:34] jokes apart, as you want, I can make it pretty quickly (says the man that didn't look at the code) [13:17:05] oh now I really feel tempted to tweak the @retry error wording for that hall of fame [13:17:34] * volans hit 2 stones with one bird, achievement unlocked [13:17:57] yeah I would almost miss that volans just turned my enhancement proposal around [13:18:11] eheheh [13:28:24] wat, no f-strings [13:28:34] OLD CODE [13:29:19] not even type hints (yet) ;) [13:49:10] Hello, dear DBAs, I have a question. So I need to create a new table in the database for a merged patch. I've been told that I just go and do 'sql.php --wiki=metawiki extensions/blabla/schema.sql' - is that really it? just like that? [13:49:35] Pchelolo: Probably devs can help you with that, as we don't really create tables, it is done as part of the train [13:49:41] maybe Amir1 or Reedy can help here [13:49:46] lol I just sent him here [13:49:51] haha [13:50:05] hehe :) [13:50:15] this is like a quest [13:50:16] We never create tables, unless it is an emergency and we do it directly on the affected wiki [13:50:29] Pchelolo: You can do `sql metawiki --write` and paste the SQL and then have the DBAs tell you off too [13:50:42] or is it `sql --write metawiki` these days [13:50:44] Pchelolo: you seek the knights of the round Table. ;) [13:50:45] whichever [13:51:07] Reedy: which host should I be doing the sql.php --wiki=metawiki extensions/blabla/schema.sql from in prod? [13:51:27] mwmaint? [13:51:33] Pchelolo: but it is usually created by whoever is doing the release or that is what I have always thought [13:51:46] Yeah, mwmaint (codfw) should be fine [13:52:02] For a run like that, you could probably use the deploy host. But opinions vary on that one [13:52:55] oki. thank you. It's a bit scary, never done that before and it feels too easy [13:54:39] Pchelolo: once you are done, I can double check for you if the table was created where you expected it :) [13:55:04] thank you. I'll ping you in a bit, having a meeting now [13:55:28] sure [13:55:30] no worries [14:00:50] 10DBA: Create a database CPU saturation dashboard for codfw - https://phabricator.wikimedia.org/T261868 (10LSobanski) [14:04:59] there are issues on db2120? [14:06:09] it is probably going to page [14:06:24] I am checking it [14:06:47] it crashed? [14:06:55] Uptime | 481 [14:06:58] I don't know, I am on it [14:07:04] let's depool [14:09:18] did you start replication, icinga said it was stopped, but it is now running [14:09:25] yes [14:10:18] ugly error log [14:10:31] I am on it [14:14:33] Sep 02 13:58:20 db2120 mysqld[1095]: 2020-09-02 13:58:20 107197705 [ERROR] InnoDB: Database page corruption on disk or a failed file read of tablespace metawiki/content page [page id: space=4936, page number=49350]. You may have to recover from a backup. [14:14:47] 10DBA: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) [14:15:13] marostegui: i can take that task [14:15:19] kormat: thank you [14:15:48] 10DBA: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Kormat) a:03Kormat [14:16:05] 10DBA, 10User-Kormat: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Kormat) [14:16:16] mw stopped complaining at 07 [14:17:20] kormat: let's full-upgrade it too, so we can have the new kernel installed to advance on the kernel upgrade task too [14:17:21] 10DBA, 10User-Kormat: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) HW logs look ok, controller logs and ilo ones. This host history is {T236453} [14:17:36] marostegui: 👍 [14:17:45] for reference: https://logstash.wikimedia.org/goto/59a7fa6aeea32b131a52134d3a71f62c [14:18:02] I am going to send a bug to mariadb [14:18:30] I guess crash happened at 13:58, it came back at 14:01 and then it got lag check errors [14:18:35] kormat: it will get the new 10.4.14 so, reminder: mysql_upgrade [14:18:42] marostegui: ack [14:18:59] marostegui: suspect a similar trend? I haven't seen the other ticket [14:19:33] oh, a regular raid degraded [14:20:36] jynus: no, different error from the labsdb crashes [14:20:52] ah, I was about to ask if it be useful to start a meta-task with "suspicious crashes" [14:20:57] if it is different, no need [14:21:43] kormat: give me 1 second to check where db2120 data came (backup?) [14:21:50] *from [14:21:55] ok [14:23:04] it was cloned from db2040 5 years ago [14:23:16] so not something to worry about, please continue [14:23:22] *continue, kormat [14:23:47] it helps me to discard potential corruption on a backup host [14:24:08] ok cool :) [14:28:31] 10DBA, 10User-Kormat: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Kormat) Host is downtimed for 24h, and recovery from backups has started. [14:28:34] you should have relatively fresh s7 snapshots [14:28:55] yeah, looks like 5am this morning [14:33:40] 10DBA, 10User-Kormat: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) It could be this https://jira.mariadb.org/browse/MDEV-21165 and/or https://jira.mariadb.org/browse/MDEV-19978 although it says it is fixed. I have created: https://jira.mariadb.org/browse/MDEV-23653 [14:34:02] 10DBA, 10User-Kormat: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) p:05Triage→03Medium [14:35:53] 10DBA: Create a database CPU saturation dashboard for codfw - https://phabricator.wikimedia.org/T261868 (10Marostegui) p:05Triage→03Medium To give some context to why we have an specific CPU one for s8 (wikidatawiki): a few months ago wikidatawiki was having serious performance issues, when we compressed Inn... [14:43:33] LOL @ "MariaDB sustained replica lag" NaN for db2120 (but not complaining about it, it is the best option for such alert) [14:44:51] I wonder if we should make the mariadb specific alerts depend on the process alert or it would make things more complicated [14:45:11] things will get Much better when we can use alertmanager for alerting [14:45:17] ah, we cannot- [14:45:31] because if one instances of 2 was down, it would not report alerts for the other [14:45:45] always cursed duality instance - server [14:49:07] just multiplicate the lags, if one is 0 or NaN you get a failure :-P [14:49:10] * volans hides [14:49:35] #evilalerting [14:50:22] volans: if lag is NaN * NULL, does a blackhole gets created? [14:50:30] *get [14:50:32] most likely [15:34:23] marostegui: sorry, I'm on vacation, is there anything I can help with right now? [15:35:54] 10DBA, 10User-Kormat: db2120 crashed - https://phabricator.wikimedia.org/T261869 (10Kormat) Restore completed, packages upgraded, and machine currently rebooting. Will then do mysql_upgrade, and start replication. [15:39:07] Amir1: No, all sorted, sorry for the ping. Enjoy your holidays! [16:44:43] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 2), and 2 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Pchelolo) >>! In T258711#6332635, @Kormat wrote: > @... [16:50:37] marostegui: I think I created the oauth_ratelimit_client_tier in all the correct places :)