[09:26:37] Emperor: o/ we may have found a way to make the sretest2010's reimage to work reliably, so hopefully next week we should be able to allow testing on it etc.. I know it is probably late for the purchase decision, but it may inform future ones [09:27:07] tried to be as quick as possible but that host doesn't like me [09:29:28] elukey: cool, thanks for all the work on this (we hae indeed ordered for the Q4 stuff, but there is more Config-J stuff coming early next FY) [09:49:13] Should the "MariaDB weights and pooling" dashboard show updates in real time? [09:50:42] if you mean: https://noc.wikimedia.org/db.php yes, as this is a dashboard the app creates [09:50:57] if it is not there, the app cannot see it [09:51:25] You also have: https://noc.wikimedia.org/dbconfig/eqiad.json [09:51:58] This one https://grafana.wikimedia.org/goto/bfmtjiofyo2dcd?orgId=1 [09:52:12] cezmunsta: for some values of "real time", consider few minutes of delay for prometheus scraping [09:52:41] that would depend on the scrapper, but prometheus is push only, so it has at least a 1 minute delay between scrapes [09:53:58] 09:35 Cookbook sre.mysql.depool (exit_code=0) depool db1172: Upgrading db1172.eqiad.wmnet [09:54:29] It doesn't show as depooled on that chart [09:57:30] I think that must be a problem with the logic of the graph, it is not sending updates, so it assumes it does not change [09:57:56] in reality the host is down, so it shold show unknown or print dbctl state [09:58:42] looking [09:58:46] compare with: https://grafana.wikimedia.org/goto/ffmtk4jxh9zb4c?orgId=1 [09:58:58] which correctly identifies the host as down [09:59:11] or "unpollable" [09:59:12] yep [10:00:47] but I confirm there is no weight for db1172 on dbctl [10:01:59] plus we would get a lot of mw db errors at https://logstash.wikimedia.org/goto/83874c6c6b848b8236a12c8f470be6f8 [10:02:11] I see it depooled on https://zarcillo.wikimedia.org/ui/instances [10:02:26] (I am only commenting on this to show different monitoring tools) [10:04:29] ty [10:05:46] in any case, depooling is the nice way of doing maintenance, the app itself stops sending traffic if a host is detected as down/unresponsive/full of connections [10:06:07] but ofc on the fly queries fail when a host crashes [10:06:23] the metric is mysql_instance_pooled{hostname="db1172"}, I'll investigate why it does not show depooled [10:07:24] ah, there seems to be a stale metric in prometheus [10:08:41] luckyly, prometheus is "just" a monitoring tool, and not the canonical location, who lives in etcd [11:06:40] cezmunsta: are you reimagining hosts during the day? [11:13:27] Just the 2 that are both almost done repooling [12:51:43] What are the "normal" reasons for seeing a wikidata user with (HY000/2002): Connection timed out ... given that the DB has connections from elsewhere? [12:57:22] what do you mean wikidata user? [12:57:45] there is a timeout for certain subset of queries [12:58:07] if they take too long they are terminated to make sure they don't clog the max_connections [12:58:39] "Error connecting to db1214 as user wikiuser2023: :real_connect(): (HY000/2002): Connection timed out" an example [12:58:44] there is also other reasons- the session being open for too long without activity [12:58:53] because a bad app design [12:59:57] last time we calculated there are ~4 million possible combinations of query types (not counting changes of literals) so not all api calls may be optimized to run in under X seconds [13:00:12] that's just for mw application [13:00:57] but the query limits will be the biggest cause, followed by cold cache due to recent restart or uncommon data queried [13:02:07] mos queries would be run with max_statement_time=60 (not sure of the current limit tbh) [13:02:43] for example, watchlist queries could query milions of rows with complex joins [13:02:57] check performance_chema digests tables for some summary [13:03:34] we don't have hard slos but as long as it is bellow 0.1% more or less we consider it normal [13:05:06] We serve almost 1 million queries per second, small blips do happen: https://grafana.wikimedia.org/goto/ffmu0pt9c6o74a?orgId=1 [13:06:15] see for example this spike on the 17 of may: https://grafana.wikimedia.org/goto/afmu0tjhb9nuoe?orgId=1 [13:07:23] it may look worrying, but then you see that it was of 7 errors per second, or a 0.0007% of error rate [13:10:11] I was just having a look through errors to see what showed up and reading the stacktrace of one as if it had just connected and failed, as it showed real_connect. The only "gone away" was hours before for that same DB host, but plenty of the real_connect issues [13:11:24] yeah, connections are costly, normally they are reused, but if there was an overload on a single host, those would show up [13:11:48] please note the app is quite "fat" in terms of db logic, so a db error doesn't necesarilly mean a user error [13:12:10] ack [13:13:44] but it is nice to have all the layer's response when debugging issues [13:14:28] but I have to say, normally they are quite loud, e.g. in 10 seconds, the db gets saturated if it gets 10% slower and starts screaming on monitoring :-D [13:15:01] it is nice to get trends, for example, when upgrading os or package, for regressions [13:17:04] one thing we need better is an aggregated processlist in a private database, we used to have one, but it got unmaintained, and we need a replacement [13:18:53] here is an example of an interesting investigation on db performance after an upgrade: https://phabricator.wikimedia.org/T192551 [13:20:12] that was the solution, this is the investigation for that: https://phabricator.wikimedia.org/T191996 [13:20:49] re aggregated processlist, there is infoSchemaProcesslistQuery in the info_schema_processlist collector, but it doesn't look like that is used judging by the metrics explorer [13:21:03] no, we cannot enable it for prometheus [13:21:08] because we cannot make it public [13:21:24] it has to be a private store (we cannot leak user activity) [13:22:07] it should be the prometheus collector on a private prometheus or some other process collecting I_S schema and storing somewhere private [13:23:33] this is the ticket I filed which I think should have high prio because how blind it makes us when debugging: https://phabricator.wikimedia.org/T388813 [15:58:21] cezmunsta: I found the issue with the pooling metric, deploying a fix [15:59:21] cool, what was the issue? [16:01:21] cezmunsta: should be this https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/commit/4e7bfc0ba89bc3ff358d7b686d916f3be94045b3 - but the proper fix is removing the label [16:06:31] Removing the label? [16:10:31] removing the "role" label from the existing gauges and introducing a new one [16:11:16] That changes the time series though [16:13:35] yep, making the queries in grafana a bit cleaner [16:37:30] federico3: that issue with has_replicas seems a somewhat pressing issue, as unless I misread it causes SET SESSION sql_log_bin=0 along with SQL to be executed. Is there a ticket for the FIXME? [16:38:50] ah nvm it doesn't :) [16:39:20] Time to /sbin/poweroff :D Enjoy the w/e o/ [16:40:01] you mean https://gitlab.wikimedia.org/repos/data_persistence/dbtools/auto_schema/-/merge_requests/13#note_209498 ? [16:40:58] yes, nothing urgent, it's been running ok for years :) [16:44:35] (and yes, enjoy the w/e o/)