[09:26:37] <elukey>	 Emperor: o/ we may have found a way to make the sretest2010's reimage to work reliably, so hopefully next week we should be able to allow testing on it etc.. I know it is probably late for the purchase decision, but it may inform future ones
[09:27:07] <elukey>	 tried to be as quick as possible but that host doesn't like me
[09:29:28] <Emperor>	 elukey: cool, thanks for all the work on this (we hae indeed ordered for the Q4 stuff, but there is more Config-J stuff coming early next FY)
[09:49:13] <cezmunsta>	 Should the "MariaDB weights and pooling" dashboard show updates in real time?
[09:50:42] <jynus>	 if you mean: https://noc.wikimedia.org/db.php yes, as this is a dashboard the app creates
[09:50:57] <jynus>	 if it is not there, the app cannot see it
[09:51:25] <jynus>	 You also have: https://noc.wikimedia.org/dbconfig/eqiad.json
[09:51:58] <cezmunsta>	 This one https://grafana.wikimedia.org/goto/bfmtjiofyo2dcd?orgId=1
[09:52:12] <federico3>	 cezmunsta: for some values of "real time", consider few minutes of delay for prometheus scraping
[09:52:41] <jynus>	 that would depend on the scrapper, but prometheus is push only, so it has at least a 1 minute delay between scrapes
[09:53:58] <cezmunsta>	 09:35 Cookbook sre.mysql.depool (exit_code=0) depool db1172: Upgrading db1172.eqiad.wmnet
[09:54:29] <cezmunsta>	 It doesn't show as depooled on that chart
[09:57:30] <jynus>	 I think that must be a problem with the logic of the graph, it is not sending updates, so it assumes it does not change
[09:57:56] <jynus>	 in reality the host is down, so it shold show unknown or print dbctl state
[09:58:42] <federico3>	 looking
[09:58:46] <jynus>	 compare with: https://grafana.wikimedia.org/goto/ffmtk4jxh9zb4c?orgId=1
[09:58:58] <jynus>	 which correctly identifies the host as down
[09:59:11] <jynus>	 or "unpollable"
[09:59:12] <cezmunsta>	 yep
[10:00:47] <jynus>	 but I confirm there is no weight for db1172 on dbctl 
[10:01:59] <jynus>	 plus we would get a lot of mw db errors at https://logstash.wikimedia.org/goto/83874c6c6b848b8236a12c8f470be6f8
[10:02:11] <federico3>	 I see it depooled on https://zarcillo.wikimedia.org/ui/instances 
[10:02:26] <jynus>	 (I am only commenting on this to show different monitoring tools)
[10:04:29] <cezmunsta>	 ty
[10:05:46] <jynus>	 in any case, depooling is the nice way of doing maintenance, the app itself stops sending traffic if a host is detected as down/unresponsive/full of connections
[10:06:07] <jynus>	 but ofc on the fly queries fail when a host crashes
[10:06:23] <federico3>	 the metric is mysql_instance_pooled{hostname="db1172"}, I'll investigate why it does not show depooled
[10:07:24] <federico3>	 ah, there seems to be a stale metric in prometheus
[10:08:41] <jynus>	 luckyly, prometheus is "just" a monitoring tool, and not the canonical location, who lives in etcd
[11:06:40] <federico3>	 cezmunsta: are you reimagining hosts during the day?
[11:13:27] <cezmunsta>	 Just the 2 that are both almost done repooling
[12:51:43] <cezmunsta>	 What are the "normal" reasons for seeing a wikidata user with (HY000/2002): Connection timed out ... given that the DB has connections from elsewhere?
[12:57:22] <jynus>	 what do you mean wikidata user?
[12:57:45] <jynus>	 there is a timeout for certain subset of queries
[12:58:07] <jynus>	 if they take too long they are terminated to make sure they don't clog the max_connections
[12:58:39] <cezmunsta>	 "Error connecting to db1214 as user wikiuser2023: :real_connect(): (HY000/2002): Connection timed out" an example
[12:58:44] <jynus>	 there is also other reasons- the session being open for too long without activity
[12:58:53] <jynus>	 because a bad app design
[12:59:57] <jynus>	 last time we calculated there are ~4 million possible combinations of query types (not counting changes of literals) so not all api calls may be optimized to run in under X seconds
[13:00:12] <jynus>	 that's just for mw application
[13:00:57] <jynus>	 but the query limits will be the biggest cause, followed by cold cache due to recent restart or uncommon data queried
[13:02:07] <jynus>	 mos queries would be run with max_statement_time=60 (not sure of the current limit tbh)
[13:02:43] <jynus>	 for example, watchlist queries could query milions of rows with complex joins
[13:02:57] <jynus>	 check performance_chema digests tables for some summary
[13:03:34] <jynus>	 we don't have hard slos but as long as it is bellow 0.1% more or less we consider it normal
[13:05:06] <jynus>	 We serve almost 1 million queries per second, small blips do happen: https://grafana.wikimedia.org/goto/ffmu0pt9c6o74a?orgId=1
[13:06:15] <jynus>	 see for example this spike on the 17 of may: https://grafana.wikimedia.org/goto/afmu0tjhb9nuoe?orgId=1
[13:07:23] <jynus>	 it may look worrying, but then you see that it was of 7 errors per second, or a 0.0007% of error rate
[13:10:11] <cezmunsta>	 I was just having a look through errors to see what showed up and reading the stacktrace of one as if it had just connected and failed, as it showed real_connect. The only "gone away" was hours before for that same DB host, but plenty of the real_connect issues
[13:11:24] <jynus>	 yeah, connections are costly, normally they are reused, but if there was an overload on a single host, those would show up
[13:11:48] <jynus>	 please note the app is quite "fat" in terms of db logic, so a db error doesn't necesarilly mean a user error
[13:12:10] <cezmunsta>	 ack
[13:13:44] <jynus>	 but it is nice to have all the layer's response when debugging issues
[13:14:28] <jynus>	 but I have to say, normally they are quite loud, e.g. in 10 seconds, the db gets saturated if it gets 10% slower and starts screaming on monitoring :-D
[13:15:01] <jynus>	 it is nice to get trends, for example, when upgrading os or package, for regressions
[13:17:04] <jynus>	 one thing we need better is an aggregated processlist in a private database, we used to have one, but it got unmaintained, and we need a replacement
[13:18:53] <jynus>	 here is an example of an interesting investigation on db performance after an upgrade: https://phabricator.wikimedia.org/T192551
[13:20:12] <jynus>	 that was the solution, this is the investigation for that: https://phabricator.wikimedia.org/T191996
[13:20:49] <cezmunsta>	 re aggregated processlist, there is infoSchemaProcesslistQuery in the info_schema_processlist collector, but it doesn't look like that is used judging by the metrics explorer
[13:21:03] <jynus>	 no, we cannot enable it for prometheus
[13:21:08] <jynus>	 because we cannot make it public
[13:21:24] <jynus>	 it has to be a private store (we cannot leak user activity)
[13:22:07] <jynus>	 it should be the prometheus collector on a private prometheus or some other process collecting I_S schema and storing somewhere private
[13:23:33] <jynus>	 this is the ticket I filed which I think should have high prio because how blind it makes us when debugging: https://phabricator.wikimedia.org/T388813
[15:58:21] <federico3>	 cezmunsta: I found the issue with the pooling metric, deploying a fix
[15:59:21] <cezmunsta>	 cool, what was the issue?
[16:01:21] <federico3>	 cezmunsta: should be this https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/commit/4e7bfc0ba89bc3ff358d7b686d916f3be94045b3 - but the proper fix is removing the label 
[16:06:31] <cezmunsta>	 Removing the label?
[16:10:31] <federico3>	 removing the "role" label from the existing gauges and introducing a new one
[16:11:16] <cezmunsta>	 That changes the time series though
[16:13:35] <federico3>	 yep, making the queries in grafana a bit cleaner
[16:37:30] <cezmunsta>	 federico3: that issue with has_replicas seems a somewhat pressing issue, as unless I misread it causes SET SESSION sql_log_bin=0 along with SQL to be executed. Is there a ticket for the FIXME?
[16:38:50] <cezmunsta>	 ah nvm it doesn't :)
[16:39:20] <cezmunsta>	 Time to /sbin/poweroff :D Enjoy the w/e o/
[16:40:01] <federico3>	 you mean https://gitlab.wikimedia.org/repos/data_persistence/dbtools/auto_schema/-/merge_requests/13#note_209498 ?
[16:40:58] <federico3>	 yes, nothing urgent, it's been running ok for years :)
[16:44:35] <federico3>	 (and yes, enjoy the w/e o/)