[00:59:09] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notifications tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) [01:00:03] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) [07:33:00] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) @Kormat do you want to do the finishing cleanup in order to close the ticket? * Confirm the host is healthy, caught up and no error on log. Making sure all monitoring systems wo... [07:47:59] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) > If we can still keep the backups for a bit (say ... 1 month?), We keep database backups... [08:23:00] jynus: hi. how do i delete the extra grants? e.g. i've tried `revoke all privileges on *.* from dump@10.64.16.31;`, but it doesn't seem to have removed all of the grants for that user [08:30:01] 10DBA, 10Operations, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) 05Open→03Resolved [08:42:43] if you want to remove a user completely, DROP USER X@Y; should work I think [08:49:08] let me know specific cases and I can have a look [09:02:56] 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10QChris) >>! In T255715#6276372, @jcrespo wrote: > We keep database backups of m2 currently for a bit... [09:21:42] jynus: that works, i'm down to 2 lines now: `diff --color -u ~kormat/s8.grants ~/kormats8.grants.cleanup` [09:21:55] bizarrely the first one seems to be an exact duplicate [09:22:21] let me see [09:23:43] you can remove both [09:23:53] that is not a grant that we really use on localhost [09:24:02] that should make it [09:24:06] any idea what the GRANT PROXY is repeated? [09:24:35] it is baffling, but I am going to guess there is something internal on the table [09:24:52] that shows duplicate but it really isent [09:25:03] or maybe the role + local privileges [09:25:11] no idea, but it is not needed, you can get rid of it [09:26:36] i got rid of the role grant, but i still have the duplicate grant proxy. i've no idea how to get rid of that without also getting rid of the previous one [09:27:36] you could also drop the user if connected from cumin [09:27:40] and recreate it [09:27:56] hopefully that will get rid of the apparent corruption [09:28:17] make sure the user you log in as [09:28:27] has with grant option privileges [09:28:46] WITH GRANT OPTION I mean [09:29:23] the cumin1001 has it, so that should be enough to drop and recreate the user [09:29:57] this is not normal, but manuel reported some weirdness with grants + roles on upgrade [09:30:36] can confirm that if i log in from cumin `show grants for current_user()` does show all privileges with grant option [09:31:37] show grants should be enough [09:31:47] but yes [09:31:48] right, it is [09:32:25] mentioning because it wouldn't be the first time I have locked myself out of a server :-) [09:32:42] root@dbstore1005.eqiad.wmnet[(none)]> REVOKE PROXY ON ''@'%' FROM 'root'@'localhost'; [09:32:42] ERROR 1698 (28000): Access denied for user 'root'@'10.64.32.25' [09:33:01] just drop the user enterily [09:33:35] that duplication doesn't look good [09:33:59] I think that made it? [09:34:04] dropped user, re-created it, [09:34:12] but grant proxy fails from cumin [09:34:20] yeah, don't add those [09:34:24] they are not really needed [09:34:34] it is ok as it is [09:34:41] check that you can log in locally [09:34:44] and we are done [09:34:48] i can, yeah [09:37:59] jynus: there's one more thing re: grants, [09:38:27] the logs show this twice every hour: `Access denied for user 'research'@'10.%' to database 'wikidatawiki'` [09:39:17] i've compared that user to the one on dbstore1003, and they're identical _except_ for one thing: on dbstore1003 it has a default_role value of `research_role`. on dbstore1005 it has no entry in default_role [09:41:41] yeah, that was the issue with upgrade [09:41:44] it removed default roles [09:41:46] add it [09:42:30] https://mariadb.com/kb/en/set-default-role/ [09:42:42] done [09:42:55] I cannot remember why we use roles here for only 1 account [09:43:08] I think the intention was to move from a shared account to per-individual/application account [09:43:11] but never materialized [09:43:39] can you check the errors are gone or check with analytics they can log in? [09:44:28] they happen at the start of every hour, i'll poke elukey [09:44:53] we should document the role issue on the upgrade notes [09:45:03] I will ask manuel for details next week [09:45:08] +1 [09:45:21] as he encountered it and knew when and why exactly it happened [09:45:30] we don't use roles on mw hosts [09:45:51] but they are for certain hosts with multiple users [09:46:16] maybe we should drop the idea of using them for read only accounts if they are unreliable [09:46:27] elukey: hi there! can you confirm that your 'research' user can access dbstore1005's s8 db now? [09:46:54] oh, I think in this case they are not lost, as much as pt-show-grants not saving it [09:47:03] we should document it that anyway [09:47:09] elukey: (specifically it was failing with `Access denied for user 'research'@'10.%' to database 'wikidatawiki'`) [09:49:07] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) >>! In T256966#6276329, @jcrespo wrote: > * Confirm the host is healthy, caught up and no error on log. Making sure all monitoring systems work as intended (prometheus, tendril, .... [09:55:18] kormat: I can check yes [09:56:56] mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> [09:56:58] yep! [09:57:10] \o/ [09:57:36] thanks, everyone involved in solving the issue [09:58:56] also congratulations for the mysql team, as we are getting closer to migrate to 8.0 instead [09:59:04] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) 05Open→03Resolved @elukey confirms that analytics can query again, and i've removed the temporary grants files from my home dir. [10:15:00] thanks a lot for all the dbstore1005 work folks! [10:25:01] elukey: yw :) [10:27:46] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10Kormat) [12:15:27] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [12:16:01] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) Ported wikidata-database-cpu-saturation, just needed to change the data source for each graph. [13:18:06] * kormat facepalms [13:18:18] i spent the last 45mins trying to debug some weird grafana issue [13:18:41] only to realise that the metric we're using for label_values() in this dashboard doesn't exist in the debian buster exporter [13:19:46] it happens [13:20:06] that is why I try sanity checking with someone else when possible [14:37:33] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [14:39:07] there is a new group called "databases" [14:39:17] maybe from analytics classification? [14:39:47] jynus: what are you referring to? [14:40:15] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=databases&var-shard=All&var-role=All [14:40:46] yeah, I think it is analytics dbs [14:40:55] we need some coordionation there [14:41:04] yeaah, it is. [14:41:13] with analytics I mean [14:41:16] https://grafana.wikimedia.org/explore?orgId=1&left=%5B%22now-1h%22,%22now%22,%22thanos%22,%7B%22expr%22:%22mysql_up%7Bjob%3D%5C%22mysql-databases%5C%22%7D%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D [14:41:29] maybe creating a new analytics group or putting them on misc [14:41:51] they have an extra `cluster` label. we could filter that out, for now [14:42:07] well, I think it is ok for them to be on the monitoring [14:42:21] but let's talk to analytics to have a proper classification [14:42:36] "databases" group I think you would agree is too generic :-D [14:42:42] indeed :) [14:42:56] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) mysql-aggretated ported. This was more involved. The steps were: 1. convert from `$dc` source var to `$site` query parameter 1. change the metric used for label_values to one that is prese... [14:43:05] we can talk to them to rename the job to something more specific [14:45:37] elukey: i see you haven't had the good sense to leave this channel yet ;) [14:46:22] I wonder if they were worried to integrate their dbs on our monitoring [14:46:48] I think we could add them to zarcillo and classify them with all others [14:46:51] kormat: the data persistence folks are too interesting, you got me :D [14:46:58] haha [14:47:21] it wouldn't take away from them owning them [14:47:30] but we would have all inventoried in the same place [14:47:35] elukey: i fixed one of our dashboards, which has revelaed that you fine folks are running things with job=mysql-databases [14:48:24] never trust analytics I keep saying that [14:48:48] the job name is a bit too generic. if you have a look at https://w.wiki/Vyb, you can see we have a number of other mysql-* jobs [14:48:52] elukey: :D [14:49:13] elukey: so, maybe you could rename your jobs to mysql-analytics or something like that? [14:49:29] or mysql-ishouldhavelefthashdatabases [14:49:30] kormat: sorry for the dumb question but what is "job" in this context? [14:49:44] prometheus definition identifier [14:49:45] elukey: prometheus label that's attached to the scrape [14:50:02] ah there you go, never really tuned it [14:50:11] we could put them as misc + matomo section [14:50:15] etc. [14:50:17] for sure we can change it, I probably added the exporter blindly at the time [14:50:18] up to you [14:50:20] elukey: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/analytics.pp#L101 [14:50:53] ah in the prometheus analytics master config, interesting [14:51:12] what would work best for you? mysql-analytics? [14:51:39] elukey: the main issue is that with the thanos change [14:51:39] I can quickly change it now [14:51:43] everthing is on the same source [14:51:44] jynus had some suggestions above, i honestly don't know enough about our groups to answer myself [14:52:00] so now your dbs show as group "databases" [14:52:07] kormat: that is a good excuse to avoid a naming battle! :P [14:52:18] kormat: please speak up [14:52:26] groups are arbitrary [14:52:31] my vote would be for mysql-analytics [14:52:35] they only for classification [14:52:36] +1 [14:52:40] that way it's clear who runs it [14:53:04] the only thing you should know is that sections and groups [14:53:13] are perpendicular [14:53:16] i could also live with mysql-kormatwasmeantome [14:53:31] e.g. there could be a section s2 on core and a section s2 on labs [14:53:46] and you know I would like to rename core to mw [14:53:52] plus other changes more clear [14:53:55] kormat: that is another great name, now I am on the fence about what's best [14:53:58] yeah, big +1 from me there re: core=>mw [14:54:02] elukey: :D [14:54:27] kormat: the reason I am not implemting it yet is because legacy- I am waiting for the new puppet refactor first [14:54:52] jynus: that reminds me - i should add this to the puppet refactor task [14:55:05] kormat: there is also some agreement to be had [14:55:11] for example, mw may be a large group [14:55:17] jynus: sure, no doubt [14:55:18] maybe there should be mw-medata [14:55:22] *mw-metadata [14:55:26] mw-content [14:55:31] mw-parsercache [14:55:33] et.c [14:55:44] so not touching the classification yet [14:55:54] 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [14:56:05] jynus: i'm not sure if you've seen ^ yet [14:56:06] +1 [14:56:30] one question [14:56:38] could you change the visible name of the label [14:56:42] from shard to section? [14:56:57] in the dashboard? sure, no problem. [14:56:58] I think shard is being used internally still [14:57:06] but we can start changing the cometic ones [14:57:31] done [14:57:47] https://gerrit.wikimedia.org/r/c/operations/puppet/+/609440 [14:58:04] will add Filippo to it as well to know if it is ok or not [15:01:43] kormat: to give you context, my team has often lived in a very "custom" set up during the past years, especially when dealing with dbs, and we are now trying to standardize our process/hosts/puppet-config as much as possible, so feel free to reach out anytime you see a horrible thing in puppet [15:03:37] ok, that's really good to know, thanks! [15:03:54] kormat: unrelated, but now checking the dashboard [15:04:03] I am seein a lot of collection failures [15:04:48] elukey: currently i'm mostly being horrified by our own puppet code, but i have a cleanup planned. when i do, i'll cover your stuff wherever possible, or at a minimum say "hey, if you switch to this new profile, you can drop most of the boilerplate" etc [15:05:01] es2025:9104, dbstore1005:13318, ... [15:05:07] jynus: hmm, yeah [15:05:23] dbstore1005 could be the issue, but the others don't know [15:06:25] kormat: +1 great! [15:07:31] kormat: I am going to restart prometheus exporterd on dbstore1005 [15:08:19] jynus: i'm doing the same on es2025 [15:08:30] is es2025 10.4? [15:08:44] maybe restarting prometheus was forgotten [15:08:45] it's weird - the metrics say `mysql_up 1`, but last_scrape_error is always 1 whenever i try [15:08:53] it is, yes [15:09:04] and there are errors about connecting to the socket in journalctl [15:09:14] if it works, we can do a safer restart on all hosts affected [15:09:29] the exporter is a stateless service to safely restart anytime [15:09:49] yep, that fixed it [15:09:54] dbstore1005:13318 29, so I think that worked there [15:10:41] ok, take half of them :-D [15:10:52] we should have alerting for this [15:11:29] jynus: i'll take them all [15:11:36] sure [15:11:47] we can divide by dc [15:11:53] or use cumin [15:13:33] sure was supposed to be "sure?" [15:13:48] ah :) yes, i'm sure, it's not hassle [15:14:00] this is not really part of the ticket work [15:15:20] ah [15:15:38] elukey: the exporter on analytics1030 is unhappy because it doesn't have access to mysql [15:15:52] `Jul 03 15:09:40 analytics1030 prometheus-mysqld-exporter[15532]: time="2020-07-03T15:09:40Z" level=error msg="Error pinging mysqld: Error 1045: Access denied for user 'prometheus'@'localhost' (using password: YES)" source="mysqld_exporter.go:252"` [15:16:32] that is a test node, will fix it thanks [15:21:59] 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) [15:22:09] 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) p:05Triage→03Medium [15:22:19] jynus: all should be fixed now [15:22:43] hmm. except maybe db2088. checking. [15:23:27] ah. that has a different issue. it's getting access denied from mysqld [15:24:55] can I have a look? [15:25:12] please [15:26:41] grants seem bad [15:26:47] it is not using unix socket [15:30:49] but only db2088:13312 is failiing, not s1? [15:31:15] yep [15:31:23] ah, I see it now [15:31:30] it is confusing because if you request metrics [15:31:38] you get a lot of them with mysql_up 0 [15:32:59] yeah, it is the grants [15:33:15] this may be ineteresting, in case it happens to you for some reason [15:34:47] I think it should work now [15:36:28] yep, looks good [15:36:38] so the only missing are analytics1030:13306 and db1108:9104 [15:37:11] which is WIP AFAIK [15:38:02] lot of load on db1118 for a friday [15:38:15] it could be due to db1089 being down [15:38:55] jynus: from what i can see, everything except the analytics host is recovering [15:39:33] yeah, good work [15:39:53] I am going to tune down db1118 load [15:40:02] so it is spread more evenly