[08:16:18] I was looking at wfLogDBError and saw a spike in connection errors between 20:27:00 and 20:27:05 yesterday: https://logstash.wikimedia.org/#dashboard/temp/AVRb73y_CsPTNesW7mAG [08:16:32] that reflects a spike in Aborted_connects on tendril [08:16:43] from db1022 [08:17:23] just FYI, as a possibile thread-pool effect [09:03:27] $shard == 'es1' ? [09:03:41] ah, for off [09:03:44] ok [09:04:55] may I ask for a comment there, it would simplify understanding it (again, thinkin on non-DBAs) [09:05:01] ok [09:05:43] I'm improving the actual output in the my.cnf file, there where too many newlines <% %> vs <% -%> [09:07:02] we should also try effects of semysync on topologie changes (I think it should work transparently), but that is not part of puppet [09:08:33] if we have B and C slave of A with semi-sync and move C as a slave of B: A->B->C then the replica B->C will not have semi-sync because B doesn't have the master plugin [09:08:36] laoded [09:08:45] yes [09:09:10] in a perfect world, we would also disable it automatically [09:09:20] +1 [09:11:13] also, once pooling is on conftool, certain checks can be disabled if a server is depooled [09:46:08] three new hosts in tendril: https://tendril.wikimedia.org/host/view/holmium.wikimedia.org/3306 [09:50:08] nice [11:13:24] about to reimage db1038, backups ongoing yet [11:14:02] ack, saw in -ops [11:33:24] s3 clonings are very slow due to the large amount of files involved [12:02:58] * volans lunch [12:39:42] FYI, when you come back- sanitarium instances "getting suck" on updates. Restart seems to solve it. Upgrade needed? [12:42:48] restart of the mysql instance for the shard that is "stuck"? [12:43:16] yes [12:43:28] ok [12:43:31] I happened to me for s1, s2, and s3 all on different days [12:43:49] I have preventelly do it for all shards now, before anybody notices it [12:44:15] (ssssh it is a secret, do not tell anyone) [12:44:34] or they will make me announce it and wait and all stuff [12:44:36] lol :D [12:44:38] sure [12:45:11] it is weird, I had never seen it without a hard or os isue [12:45:53] the query is on the "update" stage, but it doesn't fail or execute anything [12:46:03] just gets blocked there [12:46:32] you just do a stop or kill the replica query before? [12:46:42] there's memcached running on db1011, but that doesn't seem to get installed via role::mariadb::tendril, what is that used for? [12:47:00] I have to kill it because shutdown gets blocked by stop slave [12:47:08] right, make sense [12:47:11] and of course, being stuck it does not finish [12:47:46] I try mysqldump shutdown first, maybe that helps a bit starting the stop process [12:47:57] *mysqladmin shutdown [12:48:41] moritzm: interesting, STAT curr_connections 10, but no items, all counters are zero, for example: STAT curr_items 0 [12:49:29] moritzm, I remember sean mentioning some kind of cache for tendril, but I am not sure if he referred to just varnish or some application-level one [12:49:51] maybe it's used as a queue from tendril? because as a cache... it's empty :D [12:50:31] if it is used, it may be used by neon, tendril itself (via a federated table) or maintenance on terbium [12:50:36] but I would have to check [12:51:43] yeah looks barely used at all: STAT bytes_written 1033 with STAT uptime 23708849 :) [12:52:08] maybe he had plans [12:52:21] or maybe we broke something on it [12:53:32] it's in one line on tendril's code [12:53:46] looks like was intended to be used as a cache for table's schema [12:53:51] https://github.com/wikimedia/operations-software-tendril/search?utf8=%E2%9C%93&q=memcache [12:54:39] moritzm, I assume you ask for upgrade, or for firewall? [12:54:52] if upgrade, go on, if firewall, allow from neon [12:55:20] ok, I will allow access from neon, then :-) [12:55:37] maybe terbium? let me check [12:55:39] I just had a brief WTF since I couldn't find it in the puppetry :-) [12:56:04] ah, it is *not* on puppet [12:56:07] typical [12:56:12] that's bad... :( [12:56:13] file a bug [12:57:33] to be fair, tendril is one of those things that being broken for a few minutes is not a huge deal, unlike production dbs [13:00:14] ok [13:25:42] volans: https://tendril.wikimedia.org/report/innodb [13:26:20] Re: parameter tuning [13:27:05] if it loads [13:27:28] it takes a while [13:27:53] because it is not using memcached, duh! [13:28:13] eheheh [13:51:17] how do we want to deploy https://gerrit.wikimedia.org/r/#/c/285649/ ? [13:51:29] option A) disable puppet, depool and restart a slave [13:51:52] option B) merge, let's puppet run use db1038 or another slave to test a restart, in case of issues rollback [13:52:16] given that a puppet run cannot "hurt" I would vote for B [13:53:04] (in last revision I've just applied to production-es.my.cnf the same change of production.my.cnf) [13:53:05] yeah, I will do db1038 as soon as the copy finishes [13:53:58] it will need like hald an hour more [13:54:35] ok, I'll look for another one too so we test a simple restart and a reimage [13:55:11] go for one of the pending restarts/TLS/upgrade [13:57:55] yep [14:08:07] API s2 servers getting overloaded [14:08:39] https://logstash.wikimedia.org/#dashboard/temp/AVRdMxDdCsPTNesWBMIK [14:08:52] both of them? one is pooled the other no [14:10:06] both are [14:10:30] 'db1054' => 0, only affects normal traffic [14:10:41] yes, I was referring to normal additional traffic [14:11:11] there is a spike in threads connected [14:11:27] yep, until watchdog kicks in [14:11:31] and of course Aborted_connects too [14:11:45] from 20 to 500... [14:23:00] so far looks like a single spike [14:23:31] * volans looking forward to have more granular metrics [14:23:44] yes, watchdog usually takes care of it, but we need to check source if it happens frequently [14:24:05] I was actually playing with the new toy from Percona about monitoring the other day :) [14:29:17] tell me more [14:31:01] it's still in beta but doesn't look bad, has 2 pieces, one is for monitoring, basically a prometheus + grafana with already created dashboards [14:31:49] and the other part is to see query usage/slow queries from performance schema (or query log I guess) in a centralized way, something that is also in tendril partially [14:32:19] so, basically, what we are doing? [14:32:49] https://phabricator.wikimedia.org/T99485 [14:32:50] this is the "server" part, while on the hosts there are 2 agents running collecting the info/metric [14:33:03] https://phabricator.wikimedia.org/T119619 [14:33:36] basically yes for the monitoring part, what I found very cool is that the granularity of the data is dynamic, if you zoom on recent data it goes up to 1s data points [14:33:46] 1 second? [14:33:50] I want to look at prometheus configuration for that [14:34:00] do they sniff on the wire? [14:34:25] because otherwise it will have a lot of impact [14:35:52] agree, the impact has to be checked, IIRC should use the mysql_exporter and node_exporter for prometheus [14:36:17] I would like to see also how it scales [14:36:27] so it is not something "revolutionary", just already packaged [14:36:39] which we will proceed to steal if it is open source [14:37:06] but those kind of things are why I enabling p_s everywhere [14:38:01] I saw them shifting from the clasical nagios + cacti to something more modern [14:38:45] technically, we already have that information, log into any node an query p_s.statements_by_query_digest [14:39:06] we just are not plotting/storing it [14:39:09] yep, I think the couple of things that make sense to look at are the grafana dashboards templates that are already setup, parametrized by hostname and granularity [14:39:46] the way mysql_exporter is run, I've seen 3 instance of it each one collecting different data (and it's impact) [14:40:01] I am "working" (quotes because I do not really have much time to dedicate to it) with filippo to implement that [14:41:00] and the query analyzer, although it's clearly targeting mysql because for example the explain will be executed with format=JSON and of course failed on a mariaDB I used for test and other stuff [14:41:31] we do not need a graphical query analyzer [14:41:51] but I miss handler counters on grafana/tendril and latency counters [14:42:13] plus, you know, all grafana goodies: zoom, more backlog, etc [14:42:13] hey folks [14:42:17] can I get an update on https://phabricator.wikimedia.org/T111654 ? [14:42:24] either as a post, or an updated description [14:42:37] the "packaging" part of the server for now is a docker image, while the client is just a tar.gz with an install script [14:42:50] lots of patches in there but for someone not fully acquainted with the area, it's hard to say what's done/what's left/where we're at [14:42:54] paravoid: here or on the ticket? [14:43:05] I can put a summary [14:43:11] of current statu [14:43:12] s [14:43:20] paravoid, the ticket is really long in scope [14:43:22] ticket please :) [14:43:35] ok, I'll add a where are we [14:44:18] volans, if you add it, he asked the other day about application connections [14:45:07] although I would not put it part of the scope of the ticket, I would comment on what would be needed [14:45:16] ok [14:50:39] acked lag on dbstore1001 (due to db1038 maintenance) [14:53:35] ok [14:54:00] which just finished, I will check homes and reimage it [14:54:14] so let's merge the change? :D [14:54:19] yep [14:55:02] I have a file called "broken_schemas.txt" which doesn't sound good [14:55:13] definitely not... [14:55:53] old maintenance [14:57:38] merged [15:27:05] db1038 applying puppet [15:27:22] * volans looking [15:28:22] I will create an empty instance before reloading the data [15:31:12] my.cnf looks good [15:35:11] thanks for https://phabricator.wikimedia.org/T111654#2248862 [15:36:08] :-) [15:38:28] and it booted well, with no issue. checking log now [15:39:10] no errors or warnings [15:44:05] recovering data now, ETA 3h [15:47:00] ok, good (for the config) I just lost my screen