[00:59:09] <wikibugs>	 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notifications tables - https://phabricator.wikimedia.org/T246716 (10Mholloway)
[01:00:03] <wikibugs>	 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Mholloway)
[07:33:00] <wikibugs>	 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) @Kormat do you want to do the finishing cleanup in order to close the ticket?  * Confirm the host is healthy, caught up and no error on log. Making sure all monitoring systems wo...
[07:47:59] <wikibugs>	 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10jcrespo) > If we can still keep the backups for a bit (say ... 1 month?),  We keep database backups...
[08:23:00] <kormat>	 jynus: hi. how do i delete the extra grants? e.g. i've tried `revoke all privileges on *.* from dump@10.64.16.31;`, but it doesn't seem to have removed all of the grants for that user
[08:30:01] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) 05Open→03Resolved
[08:42:43] <jynus>	 if you want to remove a user completely, DROP USER X@Y; should work I think
[08:49:08] <jynus>	 let me know specific cases and I can have a look
[09:02:56] <wikibugs>	 10DBA, 10Gerrit, 10Patch-For-Review: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 (10QChris) >>! In T255715#6276372, @jcrespo wrote: > We keep database backups of m2 currently for a bit...
[09:21:42] <kormat>	 jynus: that works, i'm down to 2 lines now: `diff --color -u ~kormat/s8.grants ~/kormats8.grants.cleanup`
[09:21:55] <kormat>	 bizarrely the first one seems to be an exact duplicate
[09:22:21] <jynus>	 let me see
[09:23:43] <jynus>	 you can remove both
[09:23:53] <jynus>	 that is not a grant that we really use on localhost
[09:24:02] <jynus>	 that should make it
[09:24:06] <kormat>	 any idea what the GRANT PROXY is repeated?
[09:24:35] <jynus>	 it is baffling, but I am going to guess there is something internal on the table
[09:24:52] <jynus>	 that shows duplicate but it really isent
[09:25:03] <jynus>	 or maybe the role + local privileges
[09:25:11] <jynus>	 no idea, but it is not needed, you can get rid of it
[09:26:36] <kormat>	 i got rid of the role grant, but i still have the duplicate grant proxy. i've no idea how to get rid of that without also getting rid of the previous one
[09:27:36] <jynus>	 you could also drop the user if connected from cumin
[09:27:40] <jynus>	 and recreate it
[09:27:56] <jynus>	 hopefully that will get rid of the apparent corruption
[09:28:17] <jynus>	 make sure the user you log in as
[09:28:27] <jynus>	 has with grant option privileges
[09:28:46] <jynus>	 WITH GRANT OPTION I mean
[09:29:23] <jynus>	 the cumin1001 has it, so that should be enough to drop and recreate the user
[09:29:57] <jynus>	 this is not normal, but manuel reported some weirdness with grants + roles on upgrade
[09:30:36] <kormat>	 can confirm that if i log in from cumin `show grants for current_user()` does show all privileges with grant option
[09:31:37] <jynus>	 show grants should be enough
[09:31:47] <jynus>	 but yes
[09:31:48] <kormat>	 right, it is
[09:32:25] <jynus>	 mentioning because it wouldn't be the first time I have locked myself out of a server :-)
[09:32:42] <kormat>	 root@dbstore1005.eqiad.wmnet[(none)]> REVOKE PROXY ON ''@'%' FROM 'root'@'localhost';
[09:32:42] <kormat>	 ERROR 1698 (28000): Access denied for user 'root'@'10.64.32.25'
[09:33:01] <jynus>	 just drop the user enterily
[09:33:35] <jynus>	 that duplication doesn't look good
[09:33:59] <jynus>	 I think that made it?
[09:34:04] <kormat>	 dropped user, re-created it,
[09:34:12] <kormat>	 but grant proxy fails from cumin
[09:34:20] <jynus>	 yeah, don't add those
[09:34:24] <jynus>	 they are not really needed
[09:34:34] <jynus>	 it is ok as it is 
[09:34:41] <jynus>	 check that you can log in locally
[09:34:44] <jynus>	 and we are done
[09:34:48] <kormat>	 i can, yeah
[09:37:59] <kormat>	 jynus: there's one more thing re: grants,
[09:38:27] <kormat>	 the logs show this twice every hour: `Access denied for user 'research'@'10.%' to database 'wikidatawiki'`
[09:39:17] <kormat>	 i've compared that user to the one on dbstore1003, and they're identical _except_  for one thing: on dbstore1003 it has a default_role value of `research_role`. on dbstore1005 it has no entry in default_role
[09:41:41] <jynus>	 yeah, that was the issue with upgrade
[09:41:44] <jynus>	 it removed default roles
[09:41:46] <jynus>	 add it
[09:42:30] <jynus>	 https://mariadb.com/kb/en/set-default-role/
[09:42:42] <kormat>	 done
[09:42:55] <jynus>	 I cannot remember why we use roles here for only 1 account
[09:43:08] <jynus>	 I think the intention was to move from a shared account to per-individual/application account
[09:43:11] <jynus>	 but never materialized
[09:43:39] <jynus>	 can you check the errors are gone or check with analytics they can log in?
[09:44:28] <kormat>	 they happen at the start of every hour, i'll poke elukey
[09:44:53] <jynus>	 we should document the role issue on the upgrade notes
[09:45:03] <jynus>	 I will ask manuel for details next week
[09:45:08] <kormat>	 +1
[09:45:21] <jynus>	 as he encountered it and knew when and why exactly it happened
[09:45:30] <jynus>	 we don't use roles on mw hosts
[09:45:51] <jynus>	 but they are for certain hosts with multiple users
[09:46:16] <jynus>	 maybe we should drop the idea of using them for read only accounts if they are unreliable
[09:46:27] <kormat>	 elukey: hi there! can you confirm that your 'research' user can access dbstore1005's s8 db now?
[09:46:54] <jynus>	 oh, I think in this case they are not lost, as much as pt-show-grants not saving it
[09:47:03] <jynus>	 we should document it that anyway
[09:47:09] <kormat>	 elukey: (specifically it was failing with `Access denied for user 'research'@'10.%' to database 'wikidatawiki'`)
[09:49:07] <wikibugs>	 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) >>! In T256966#6276329, @jcrespo wrote: > * Confirm the host is healthy, caught up and no error on log. Making sure all monitoring systems work as intended (prometheus, tendril, ....
[09:55:18] <elukey>	 kormat: I can check yes
[09:56:56] <elukey>	 mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]>
[09:56:58] <elukey>	 yep!
[09:57:10] <kormat>	 \o/
[09:57:36] <jynus>	 thanks, everyone involved in solving the issue
[09:58:56] <jynus>	 also congratulations for the mysql team, as we are getting closer to migrate to 8.0 instead
[09:59:04] <wikibugs>	 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) 05Open→03Resolved @elukey confirms that analytics can query again, and i've removed the temporary grants files from my home dir.
[10:15:00] <elukey>	 thanks a lot for all the dbstore1005 work folks!
[10:25:01] <kormat>	 elukey: yw :)
[10:27:46] <wikibugs>	 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10Kormat)
[12:15:27] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat)
[12:16:01] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) Ported wikidata-database-cpu-saturation, just needed to change the data source for each graph.
[13:18:06] * kormat facepalms
[13:18:18] <kormat>	 i spent the last 45mins trying to debug some weird grafana issue
[13:18:41] <kormat>	 only to realise that the metric we're using for label_values() in this dashboard doesn't exist in the debian buster exporter
[13:19:46] <jynus>	 it happens
[13:20:06] <jynus>	 that is why I try sanity checking with someone else when possible
[14:37:33] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat)
[14:39:07] <jynus>	 there is a new group called "databases"
[14:39:17] <jynus>	 maybe from analytics classification?
[14:39:47] <kormat>	 jynus: what are you referring to?
[14:40:15] <jynus>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=databases&var-shard=All&var-role=All
[14:40:46] <jynus>	 yeah, I think it is analytics dbs
[14:40:55] <jynus>	 we need some coordionation there
[14:41:04] <kormat>	 yeaah, it is.
[14:41:13] <jynus>	 with analytics I mean
[14:41:16] <kormat>	 https://grafana.wikimedia.org/explore?orgId=1&left=%5B%22now-1h%22,%22now%22,%22thanos%22,%7B%22expr%22:%22mysql_up%7Bjob%3D%5C%22mysql-databases%5C%22%7D%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D
[14:41:29] <jynus>	 maybe creating a new analytics group or putting them on misc
[14:41:51] <kormat>	 they have an extra `cluster` label. we could filter that out, for now
[14:42:07] <jynus>	 well, I think it is ok for them to be on the monitoring
[14:42:21] <jynus>	 but let's talk to analytics to have a proper classification
[14:42:36] <jynus>	 "databases" group I think you would agree is too generic :-D
[14:42:42] <kormat>	 indeed :)
[14:42:56] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) mysql-aggretated ported. This was more involved. The steps were: 1. convert from `$dc` source var to `$site` query parameter 1. change the metric used for label_values to one that is prese...
[14:43:05] <jynus>	 we can talk to them to rename the job to something more specific
[14:45:37] <kormat>	 elukey: i see you haven't had the good sense to leave this channel yet ;)
[14:46:22] <jynus>	 I wonder if they were worried to integrate their dbs on our monitoring
[14:46:48] <jynus>	 I think we could add them to zarcillo and classify them with all others
[14:46:51] <elukey>	 kormat: the data persistence folks are too interesting, you got me :D
[14:46:58] <kormat>	 haha
[14:47:21] <jynus>	 it wouldn't take away from them owning them
[14:47:30] <jynus>	 but we would have all inventoried in the same place
[14:47:35] <kormat>	 elukey: i fixed one of our dashboards, which has revelaed that you fine folks are running things with job=mysql-databases
[14:48:24] <elukey>	 never trust analytics I keep saying that
[14:48:48] <kormat>	 the job name is a bit too generic. if you have a look at https://w.wiki/Vyb, you can see we have a number of other mysql-* jobs
[14:48:52] <kormat>	 elukey: :D
[14:49:13] <kormat>	 elukey: so, maybe you could rename your jobs to mysql-analytics or something like that?
[14:49:29] <kormat>	 or mysql-ishouldhavelefthashdatabases
[14:49:30] <elukey>	 kormat: sorry for the dumb question but what is "job" in this context?
[14:49:44] <jynus>	 prometheus definition identifier
[14:49:45] <kormat>	 elukey: prometheus label that's attached to the scrape
[14:50:02] <elukey>	 ah there you go, never really tuned it 
[14:50:11] <jynus>	 we could put them as misc + matomo section
[14:50:15] <jynus>	 etc.
[14:50:17] <elukey>	 for sure we can change it, I probably added the exporter blindly at the time
[14:50:18] <jynus>	 up to you
[14:50:20] <kormat>	 elukey: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/analytics.pp#L101
[14:50:53] <elukey>	 ah in the prometheus analytics master config, interesting
[14:51:12] <elukey>	 what would work best for you? mysql-analytics?
[14:51:39] <jynus>	 elukey: the main issue is that with the thanos change
[14:51:39] <elukey>	 I can quickly change it now
[14:51:43] <jynus>	 everthing is on the same source
[14:51:44] <kormat>	 jynus had some suggestions above, i honestly don't know enough about our groups to answer myself
[14:52:00] <jynus>	 so now your dbs show as group "databases"
[14:52:07] <elukey>	 kormat: that is a good excuse to avoid a naming battle! :P
[14:52:18] <jynus>	 kormat: please speak up
[14:52:26] <jynus>	 groups are arbitrary
[14:52:31] <kormat>	 my vote would be for mysql-analytics
[14:52:35] <jynus>	 they only for classification
[14:52:36] <elukey>	 +1
[14:52:40] <kormat>	 that way it's clear who runs it
[14:53:04] <jynus>	 the only thing you should know is that sections and groups
[14:53:13] <jynus>	 are perpendicular
[14:53:16] <kormat>	 i could also live with mysql-kormatwasmeantome
[14:53:31] <jynus>	 e.g. there could be a section s2 on core and a section s2 on labs
[14:53:46] <jynus>	 and you know I would like to rename core to mw
[14:53:52] <jynus>	 plus other changes more clear
[14:53:55] <elukey>	 kormat: that is another great name, now I am on the fence about what's best
[14:53:58] <kormat>	 yeah, big +1 from me there re: core=>mw
[14:54:02] <kormat>	 elukey: :D
[14:54:27] <jynus>	 kormat: the reason I am not implemting it yet is because legacy- I am waiting for the new puppet refactor first
[14:54:52] <kormat>	 jynus: that reminds me - i should add this to the puppet refactor task
[14:55:05] <jynus>	 kormat: there is also some agreement to be had
[14:55:11] <jynus>	 for example, mw may be a large group
[14:55:17] <kormat>	 jynus: sure, no doubt
[14:55:18] <jynus>	 maybe there should be mw-medata
[14:55:22] <jynus>	 *mw-metadata
[14:55:26] <jynus>	 mw-content
[14:55:31] <jynus>	 mw-parsercache
[14:55:33] <jynus>	 et.c
[14:55:44] <jynus>	 so not touching the classification yet
[14:55:54] <wikibugs>	 10DBA, 10Operations, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat)
[14:56:05] <kormat>	 jynus: i'm not sure if you've seen ^ yet
[14:56:06] <jynus>	 +1
[14:56:30] <jynus>	 one question
[14:56:38] <jynus>	 could you change the visible name of the label
[14:56:42] <jynus>	 from shard to section?
[14:56:57] <kormat>	 in the dashboard? sure, no problem.
[14:56:58] <jynus>	 I think shard is being used internally still
[14:57:06] <jynus>	 but we can start changing the cometic ones
[14:57:31] <kormat>	 done
[14:57:47] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/609440
[14:58:04] <elukey>	 will add Filippo to it as well to know if it is ok or not
[15:01:43] <elukey>	 kormat: to give you context, my team has often lived in a very "custom" set up during the past years, especially when dealing with dbs, and we are now trying to standardize our process/hosts/puppet-config as much as possible, so feel free to reach out anytime you see a horrible thing in puppet
[15:03:37] <kormat>	 ok, that's really good to know, thanks!
[15:03:54] <jynus>	 kormat: unrelated, but now checking the dashboard
[15:04:03] <jynus>	 I am seein a lot of collection failures
[15:04:48] <kormat>	 elukey: currently i'm mostly being horrified by our own puppet code, but i have a cleanup planned. when i do, i'll cover your stuff wherever possible, or at a minimum say "hey, if you switch to this new profile, you can drop most of the boilerplate" etc
[15:05:01] <jynus>	 es2025:9104, dbstore1005:13318, ...
[15:05:07] <kormat>	 jynus: hmm, yeah
[15:05:23] <jynus>	 dbstore1005 could be the issue, but the others don't know
[15:06:25] <elukey>	 kormat: +1 great!
[15:07:31] <jynus>	 kormat: I am going to restart prometheus exporterd on dbstore1005
[15:08:19] <kormat>	 jynus: i'm doing the same on es2025
[15:08:30] <jynus>	 is es2025 10.4?
[15:08:44] <jynus>	 maybe restarting prometheus was forgotten
[15:08:45] <kormat>	 it's weird - the metrics say `mysql_up 1`, but last_scrape_error is always 1 whenever i try
[15:08:53] <kormat>	 it is, yes
[15:09:04] <kormat>	 and there are errors about connecting to the socket in journalctl
[15:09:14] <jynus>	 if it works, we can do a safer restart on all hosts affected
[15:09:29] <jynus>	 the exporter is a stateless service to safely restart anytime
[15:09:49] <kormat>	 yep, that fixed it
[15:09:54] <jynus>	 dbstore1005:13318	29, so I think that worked there
[15:10:41] <jynus>	 ok, take half of them :-D
[15:10:52] <kormat>	 we should have alerting for this
[15:11:29] <kormat>	 jynus: i'll take them all
[15:11:36] <jynus>	 sure
[15:11:47] <jynus>	 we can divide by dc
[15:11:53] <jynus>	 or use cumin
[15:13:33] <jynus>	 sure was supposed to be "sure?"
[15:13:48] <kormat>	 ah :) yes, i'm sure, it's not hassle
[15:14:00] <jynus>	 this is not really part of the ticket work
[15:15:20] <kormat>	 ah
[15:15:38] <kormat>	 elukey: the exporter on analytics1030 is unhappy because it doesn't have access to mysql
[15:15:52] <kormat>	 `Jul 03 15:09:40 analytics1030 prometheus-mysqld-exporter[15532]: time="2020-07-03T15:09:40Z" level=error msg="Error pinging mysqld: Error 1045: Access denied for user 'prometheus'@'localhost' (using password: YES)" source="mysqld_exporter.go:252"`
[15:16:32] <elukey>	 that is a test node, will fix it thanks
[15:21:59] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat)
[15:22:09] <wikibugs>	 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) p:05Triage→03Medium
[15:22:19] <kormat>	 jynus: all should be fixed now
[15:22:43] <kormat>	 hmm. except maybe db2088. checking.
[15:23:27] <kormat>	 ah. that has a different issue. it's getting access denied from mysqld
[15:24:55] <jynus>	 can I have a look?
[15:25:12] <kormat>	 please
[15:26:41] <jynus>	 grants seem bad
[15:26:47] <jynus>	 it is not using unix socket
[15:30:49] <jynus>	 but only db2088:13312 is failiing, not s1?
[15:31:15] <kormat>	 yep
[15:31:23] <jynus>	 ah, I see it now
[15:31:30] <jynus>	 it is confusing because if you request metrics
[15:31:38] <jynus>	 you get a lot of them with mysql_up 0
[15:32:59] <jynus>	 yeah, it is the grants
[15:33:15] <jynus>	 this may be ineteresting, in case it happens to you for some reason
[15:34:47] <jynus>	 I think it should work now
[15:36:28] <kormat>	 yep, looks good
[15:36:38] <jynus>	 so the only missing are analytics1030:13306 and db1108:9104
[15:37:11] <jynus>	 which is WIP AFAIK
[15:38:02] <jynus>	 lot of load on db1118 for a friday
[15:38:15] <jynus>	 it could be due to db1089 being down
[15:38:55] <kormat>	 jynus: from what i can see, everything except the analytics host is recovering
[15:39:33] <jynus>	 yeah, good work
[15:39:53] <jynus>	 I am going to tune down db1118 load
[15:40:02] <jynus>	 so it is spread more evenly