[08:58:01] So the backups of es hosts seems it was a bit faster: https://grafana.wikimedia.org/goto/z6dzZUpNR?orgId=1 [09:01:06] despite limiting concurrency to 2 [09:03:56] I will check it myself, but if you saw any issue of performance with es[12]036, es[12]040, please ping me [09:07:20] Some locking at the start, but that is to be expected only [09:09:13] I will deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124834 (sanity check is welcome) and leave the new hosts enabled, the old ones disabled [13:01:20] jynus: are you done with m1? [13:02:45] yeah, only touched it for grant setup [13:02:53] ok thanks [13:03:01] I will have to delete the previous grants but that may happen now or in 3 weeks [13:03:15] probably the latter [13:03:16] ok, I will probably change the master new week [13:03:29] good, then will deploy after that [13:03:57] do you want me to prepare a config change for dbbackups? [13:04:08] and you can merge at will [13:04:19] Nah, I can do it, no worries [13:04:25] I will add you as a reviewer [13:04:40] up to you, I don't mind doing it either [13:04:45] it is fine, thanks! [13:13:17] jynus: when would it be the less disrusptive day for you to get m1 master switched? [13:15:22] any time would work, as long as we try to avoid 0-8 am time [13:15:33] cool, I will keep that in mind! thanks [13:16:24] also, there is sometimes clogging when full backups run, the first week of the month [13:16:28] for bacula [13:16:41] I will do it next week so that's probably gone [13:16:50] that is it, mostly [13:17:13] both bacula and dbbackups are usually idle the rest of the time [13:17:23] cool [13:39:06] PROBLEM - MariaDB sustained replica lag on s4 on db1252 is CRITICAL: 33.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104 [13:39:46] marostegui: any feedback on the changes to the grafana dashboard? [13:40:04] federico3: Sorry, I didn't have time to properly check yet [13:40:10] np :) [13:40:50] Also Amir1 and jynus ^ as users of those dashboards [13:48:06] RECOVERY - MariaDB sustained replica lag on s4 on db1252 is OK: (C)10 ge (W)5 ge 3.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104 [13:49:02] about to go to lunch but what changes exactly? :D [13:52:32] I don't see the version anymore: https://i.imgur.com/J2gM2jG.png [13:53:50] and this feels a bit broken: https://imgur.com/a6W2A2p [13:55:12] There is 2 write query stats panels [13:56:11] I suggest you revert, test the changes on a separate dashboard first and then apply them (e.g. https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy this is unused) [14:00:15] We could also take the opportunity to migrate the setup to source control for bette change tracking [14:17:20] the change is about 2 aspects: [14:18:05] 1) switch to the new "Time series" visualization plugin. It visually looks almost identical to the current one but it's faster [14:18:11] not worried about the change, I am sure it is good. but the current dashboard is broken to me now [14:19:11] my guess is something got changed by accident on save [14:20:12] 2) I cloned 2 existing panels to illustrate the chart in "step mode" temporarily and then remove them. The current line interpolation hides datapoints and "lies" a little bit. The step mode shows each datapoint. [14:20:40] So testing in production ? XD [14:22:33] as usual :( [14:22:57] federico3: Let's revert for now [14:23:02] can you fix the breakage I mentioned on the images above? I don't mind the actual data details [14:23:21] feel free to apply those, but the dashboard being broken is a big deal to me [14:23:35] those = your suggested changes [14:24:14] ok, reverted [14:24:30] I'll create a copy of the whole dashboard [14:24:36] federico3: [14:56:10] I suggest you revert, test the changes on a separate dashboard first and then apply them (e.g. https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy this is unused) -> I think this is a good approach [14:24:38] federico3: Yeah, that! [14:24:50] or just apply the change, but make sure the dashboard is usable [14:24:58] either would work to me [14:25:35] the json on puppet is more work, so not asking for that, but in case you think you need git involved to track the changes [14:27:20] So now version works and dashboards look ok. I think your suggestions are ok if you can make them work nicely [14:27:50] I think the intention was good, just may need additional tweaks [14:29:14] I think some graph got lost, though because the dashboard used to finish with disk latency + some other graph, and it no longer does [14:29:53] Also replication lag is missing [14:30:28] ah no, I was just looking at a repicationless host [14:33:47] ok, this is a *new* dashboard called MariaDB https://grafana-rw.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=db1163&var-port=9104 [14:34:26] does that have all the changes you want? [14:36:44] yes, albeit I'm not sure why Disk Latency is dotted (but it was the same before) and why some charts show nothing [14:37:04] yeah, replication needs some tuning [14:37:11] it starts at 1 [14:37:18] I was going to point that [14:37:21] because it is now logarithmic [14:37:22] But the rest looks good to me [14:37:45] it is a bit harder to read when on a large rate [14:37:50] *large period [14:38:49] this is unrelated to federico's change, but I think we lost 1 graph [14:40:16] too much detail at large zooms: https://i.imgur.com/t7KLs8U.png [14:40:35] I think this makes better things at low zooms, but a worse a higher zooms [14:40:57] maybe the step can be setup dynamicly or configurable, don't know [14:44:09] ^ federico3 can you see what I mean? [14:45:16] ah the processlist is cluttered, yes [14:45:31] well, anything that is spikey [14:45:54] and sometimes one just wants "what's the average over a long time" [14:46:08] so I wonder if we can have the best of both worlds [14:46:12] does this look ok? https://grafana-rw.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=db1178&var-port=9104&from=now-7d&to=now [14:46:25] exact stuff at low zooms and smoothing at high [14:46:39] yes, unfortunately grafana/prometheus are a bit rudimentary when it comes to that [14:47:11] e.g. maybe we can have an extra parameter for the step (I don't know) [14:48:31] I don't know, I would keep both versions for a while, and personally I will try to use it, put all new stuff on the new dashboard [14:48:56] and then if there are some things we think doesn't work we can improve it and remove the old one [14:49:34] because now I don't know if it is just "it is new" or is there a practical reason, I would need time (but that is my opinion) [14:50:01] we freeze adding stuff to the old one so we don't work twice [14:50:15] and then reevaluate in a few days [14:50:57] one thing I think it is an issue is that all other graphs use a small line [14:51:07] https://grafana-rw.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status [14:51:27] which I guess it is the default? [14:51:50] so maybe you can suggest other people teams changing their style if there is a reason [14:52:04] otherwise , I would like not to drift away from the general style [14:52:36] as we are not the only users (cloud, fr, etc use the same dashboard) [14:53:37] one thing that needs fixing is the replication lag one [14:54:11] yes we can put line thickness to one [14:54:34] again, not for me, for uniformization with other dashboards [14:55:29] I think I like more the thick part, but it is weird compared to the other team's [14:56:07] and maybe it helps with the low zoom [14:56:20] federico3: small side-note, I would recommend against using irate[5m] in those charts, see T371485 [14:56:20] T371485: Grafana MySQL charts can be inconsistent when zooming out - https://phabricator.wikimedia.org/T371485 [14:57:04] using rate and $__rate_interval might also help (partially) with the zoom issues [14:57:25] dhinus: I haven't made changes in the charts but I'm aware of the issue [14:57:53] I suggest you work toghether on that, and try to evangelize for all graphs [14:57:56] ack, I just wanted you to be aware of that task :) [14:57:59] as in, for all teams [14:58:08] the rate calculation stuff in grafana is especially unreliable [14:58:09] I agree ^ [14:58:18] I think the improvements are worth reviewing [14:58:42] but I wouldn't go "we will fix it just for us" [14:58:59] as that could be worse (e.g. if the "bug" is expected everywhere else) [14:59:11] I know it is more work, but I would like to see everyone working toghether [15:00:04] e.g. federico3 maybe commenting there, as it mentions the mysql graph and suggest some change, etc [15:00:55] I don't think graphs have to perfect, as long as people are aware or it is configurable [15:00:57] is there a wiki page with recommended settings/patterns/tips for grafana for everybody? [15:01:16] federico3: thank you for volunteering to create one! [15:01:28] XDDDDD [15:01:44] There is: https://wikitech.wikimedia.org/wiki/Grafana/Best_practices [15:02:10] including a timo talk: https://www.youtube.com/watch?v=UlL6UoRUQAM [15:02:29] https://wikitech.wikimedia.org/wiki/Grafana/Best_practices#Graph_recommended_settings [15:03:32] in general, I think improving the graphs is nice, they barely changed since there was 0 graphs [15:03:45] but let's make sure we have input from as many people as reasonable [15:04:22] and if we decide on a new best recommendations, let's add them to that page and promote to everyone [15:05:30] I think my suggestions are reasonable (?) [15:06:59] jynus: big +1 from me, that page is a good start but is not seeing much love, I forgot about it myself :) [15:09:41] I think your changes + suggestion from dhinus would help make them better [15:11:58] maybe.. discussing such topics on -observability would help? [15:12:55] I was writing something similar, the observability team should be involved, even if they have limited resources they're in the best position to "steward" the guidelines [15:12:56] yeah [15:16:30] federico3: it wouldn't hurt also sending an email to cloud and analytics-sre asking for feedback, when dbas are ok with the changes [15:16:43] as they handle dbs on that dashboard, too [18:38:07] PROBLEM - MariaDB sustained replica lag on s7 on db1202 is CRITICAL: 1284 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1202&var-port=9104 [18:38:46] looking [18:39:31] it's healthy now [18:40:54] icinga-wm: wake up [18:42:07] RECOVERY - MariaDB sustained replica lag on s7 on db1202 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1202&var-port=9104