[08:58:01] <jynus>	 So the backups of es hosts seems it was a bit faster: https://grafana.wikimedia.org/goto/z6dzZUpNR?orgId=1
[09:01:06] <jynus>	 despite limiting concurrency to 2
[09:03:56] <jynus>	 I will check it myself, but if you saw any issue of performance with es[12]036, es[12]040, please ping me
[09:07:20] <jynus>	 Some locking at the start, but that is to be expected only
[09:09:13] <jynus>	 I will deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124834 (sanity check is welcome) and leave the new hosts enabled, the old ones disabled
[13:01:20] <marostegui>	 jynus: are you done with m1?
[13:02:45] <jynus>	 yeah, only touched it for grant setup
[13:02:53] <marostegui>	 ok thanks
[13:03:01] <jynus>	 I will have to delete the previous grants but that may happen now or in 3 weeks
[13:03:15] <jynus>	 probably the latter
[13:03:16] <marostegui>	 ok, I will probably change the master new week
[13:03:29] <jynus>	 good, then will deploy after that
[13:03:57] <jynus>	 do you want me to prepare a config change for dbbackups?
[13:04:08] <jynus>	 and you can merge at will
[13:04:19] <marostegui>	 Nah, I can do it, no worries
[13:04:25] <marostegui>	 I will add you as a reviewer
[13:04:40] <jynus>	 up to you, I don't mind doing it either
[13:04:45] <marostegui>	 it is fine, thanks!
[13:13:17] <marostegui>	 jynus: when would it be the less disrusptive day for you to get m1 master switched?
[13:15:22] <jynus>	 any time would work, as long as we try to avoid 0-8 am time
[13:15:33] <marostegui>	 cool, I will keep that in mind! thanks
[13:16:24] <jynus>	 also, there is sometimes clogging when full backups run, the first week of the month
[13:16:28] <jynus>	 for bacula
[13:16:41] <marostegui>	 I will do it next week so that's probably gone
[13:16:50] <jynus>	 that is it, mostly
[13:17:13] <jynus>	 both bacula and dbbackups are usually idle the rest of the time
[13:17:23] <marostegui>	 cool
[13:39:06] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1252 is CRITICAL: 33.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104
[13:39:46] <federico3>	 marostegui: any feedback on the changes to the grafana dashboard? 
[13:40:04] <marostegui>	 federico3: Sorry, I didn't have time to properly check yet
[13:40:10] <federico3>	 np :)
[13:40:50] <marostegui>	 Also Amir1 and jynus  ^ as users of those dashboards
[13:48:06] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1252 is OK: (C)10 ge (W)5 ge 3.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104
[13:49:02] <Amir1>	 about to go to lunch but what changes exactly? :D
[13:52:32] <jynus>	 I don't see the version anymore: https://i.imgur.com/J2gM2jG.png
[13:53:50] <jynus>	 and this feels a bit broken: https://imgur.com/a6W2A2p
[13:55:12] <jynus>	 There is 2 write query stats panels
[13:56:11] <jynus>	 I suggest you revert, test the changes on a separate dashboard first and then apply them (e.g. https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy this is unused)
[14:00:15] <jynus>	 We could also take the opportunity to migrate the setup to source control for bette change tracking
[14:17:20] <federico3>	 the change is about 2 aspects:
[14:18:05] <federico3>	 1) switch to the new "Time series" visualization plugin. It visually looks almost identical to the current one but it's faster 
[14:18:11] <jynus>	 not worried about the change, I am sure it is good. but the current dashboard is broken to me now
[14:19:11] <jynus>	 my guess is something got changed by accident on save
[14:20:12] <federico3>	 2) I cloned 2 existing panels to illustrate the chart in "step mode" temporarily and then remove them. The current line interpolation hides datapoints and "lies" a little bit. The step mode shows each datapoint.  
[14:20:40] <jynus>	 So testing in production ? XD
[14:22:33] <federico3>	 as usual :(
[14:22:57] <marostegui>	 federico3: Let's revert for now
[14:23:02] <jynus>	 can you fix the breakage I mentioned on the images above? I don't mind the actual data details
[14:23:21] <jynus>	 feel free to apply those, but the dashboard being broken is a big deal to me
[14:23:35] <jynus>	 those = your suggested changes
[14:24:14] <federico3>	 ok, reverted
[14:24:30] <federico3>	 I'll create a copy of the whole dashboard
[14:24:36] <marostegui>	 federico3: [14:56:10]  <jynus> I suggest you revert, test the changes on a separate dashboard first and then apply them (e.g. https://grafana.wikimedia.org/d/21pxVYS7z/jaimes-mysql-aggregated-copy this is unused) -> I think this is a good approach
[14:24:38] <marostegui>	 federico3: Yeah, that!
[14:24:50] <jynus>	 or just apply the change, but make sure the dashboard is usable
[14:24:58] <jynus>	 either would work to me
[14:25:35] <jynus>	 the json on puppet is more work, so not asking for that, but in case you think you need git involved to track the changes
[14:27:20] <jynus>	 So now version works and dashboards look ok. I think your suggestions are ok if you can make them work nicely
[14:27:50] <jynus>	 I think the intention was good, just may need additional tweaks
[14:29:14] <jynus>	 I think some graph got lost, though because the dashboard used to finish with disk latency + some other graph, and it no longer does
[14:29:53] <jynus>	 Also replication lag is missing
[14:30:28] <jynus>	 ah no, I was just looking at a repicationless host
[14:33:47] <federico3>	 ok, this is a *new* dashboard called MariaDB https://grafana-rw.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=db1163&var-port=9104
[14:34:26] <jynus>	 does that have all the changes you want?
[14:36:44] <federico3>	 yes, albeit I'm not sure why Disk Latency is dotted (but it was the same before) and why some charts show nothing
[14:37:04] <jynus>	 yeah, replication needs some tuning
[14:37:11] <jynus>	 it starts at 1
[14:37:18] <marostegui>	 I was going to point that
[14:37:21] <jynus>	 because it is now logarithmic
[14:37:22] <marostegui>	 But the rest looks good to me
[14:37:45] <jynus>	 it is a bit harder to read when on a large rate
[14:37:50] <jynus>	 *large period
[14:38:49] <jynus>	 this is unrelated to federico's change, but I think we lost 1 graph
[14:40:16] <jynus>	 too much detail at large zooms: https://i.imgur.com/t7KLs8U.png
[14:40:35] <jynus>	 I think this makes better things at low zooms, but a worse a higher zooms
[14:40:57] <jynus>	 maybe the step can be setup dynamicly or configurable, don't know
[14:44:09] <jynus>	 ^ federico3 can you see what I mean?
[14:45:16] <federico3>	 ah the processlist is cluttered, yes
[14:45:31] <jynus>	 well, anything that is spikey
[14:45:54] <jynus>	 and sometimes one just wants "what's the average over a long time"
[14:46:08] <jynus>	 so I wonder if we can have the best of both worlds
[14:46:12] <federico3>	 does this look ok? https://grafana-rw.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=db1178&var-port=9104&from=now-7d&to=now
[14:46:25] <jynus>	 exact stuff at low zooms and smoothing at high
[14:46:39] <federico3>	 yes, unfortunately grafana/prometheus are a bit rudimentary when it comes to that
[14:47:11] <jynus>	 e.g. maybe we can have an extra parameter for the step (I don't know)
[14:48:31] <jynus>	 I don't know, I would keep both versions for a while, and personally I will try to use it, put all new stuff on the new dashboard
[14:48:56] <jynus>	 and then if there are some things we think doesn't work we can improve it and remove the old one
[14:49:34] <jynus>	 because now I don't know if it is just "it is new" or is there a practical reason, I would need time (but that is my opinion)
[14:50:01] <jynus>	 we freeze adding stuff to the old one so we don't work twice
[14:50:15] <jynus>	 and then reevaluate in a few days
[14:50:57] <jynus>	 one thing I think it is an issue is that all other graphs use a small line
[14:51:07] <jynus>	 https://grafana-rw.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status
[14:51:27] <jynus>	 which I guess it is the default?
[14:51:50] <jynus>	 so maybe you can suggest other people teams changing their style if there is a reason
[14:52:04] <jynus>	 otherwise , I would like not to drift away from the general style
[14:52:36] <jynus>	 as we are not the only users (cloud, fr, etc use the same dashboard)
[14:53:37] <jynus>	 one thing that needs fixing is the replication lag one
[14:54:11] <federico3>	 yes we can put line thickness to one
[14:54:34] <jynus>	 again, not for me, for uniformization with other dashboards
[14:55:29] <jynus>	 I think I like more the thick part, but it is weird compared to the other team's
[14:56:07] <jynus>	 and maybe it helps with the low zoom
[14:56:20] <dhinus>	 federico3: small side-note, I would recommend against using irate[5m] in those charts, see T371485
[14:56:20] <stashbot>	 T371485: Grafana MySQL charts can be inconsistent when zooming out - https://phabricator.wikimedia.org/T371485
[14:57:04] <dhinus>	 using rate and $__rate_interval might also help (partially) with the zoom issues
[14:57:25] <federico3>	 dhinus: I haven't made changes in the charts but I'm aware of the issue
[14:57:53] <jynus>	 I suggest you work toghether on that, and try to evangelize for all graphs
[14:57:56] <dhinus>	 ack, I just wanted you to be aware of that task :)
[14:57:59] <jynus>	 as in, for all teams
[14:58:08] <federico3>	 the rate calculation stuff in grafana is especially unreliable
[14:58:09] <marostegui>	 I agree ^
[14:58:18] <jynus>	 I think the improvements are worth reviewing
[14:58:42] <jynus>	 but I wouldn't go "we will fix it just for us"
[14:58:59] <jynus>	 as that could be worse (e.g. if the "bug" is expected everywhere else)
[14:59:11] <jynus>	 I know it is more work, but I would like to see everyone working toghether
[15:00:04] <jynus>	 e.g. federico3 maybe commenting there, as it mentions the mysql graph and suggest some change, etc
[15:00:55] <jynus>	 I don't think graphs have to perfect, as long as people are aware or it is configurable
[15:00:57] <federico3>	 is there a wiki page with recommended settings/patterns/tips for grafana for everybody?
[15:01:16] <jynus>	 federico3: thank you for volunteering to create one!
[15:01:28] <marostegui>	 XDDDDD
[15:01:44] <jynus>	 There is: https://wikitech.wikimedia.org/wiki/Grafana/Best_practices
[15:02:10] <jynus>	 including a timo talk: https://www.youtube.com/watch?v=UlL6UoRUQAM
[15:02:29] <jynus>	 https://wikitech.wikimedia.org/wiki/Grafana/Best_practices#Graph_recommended_settings
[15:03:32] <jynus>	 in general, I think improving the graphs is nice, they barely changed since there was 0 graphs
[15:03:45] <jynus>	 but let's make sure we have input from as many people as reasonable
[15:04:22] <jynus>	 and if we decide on a new best recommendations, let's add them to that page and promote to everyone
[15:05:30] <jynus>	 I think my suggestions are reasonable (?)
[15:06:59] <dhinus>	 jynus: big +1 from me, that page is a good start but is not seeing much love, I forgot about it myself :)
[15:09:41] <jynus>	 I think your changes + suggestion from dhinus would help make them better
[15:11:58] <federico3>	 maybe.. discussing such topics on -observability would help?
[15:12:55] <dhinus>	 I was writing something similar, the observability team should be involved, even if they have limited resources they're in the best position to "steward" the guidelines
[15:12:56] <jynus>	 yeah
[15:16:30] <jynus>	 federico3: it wouldn't hurt also sending an email to cloud and analytics-sre asking for feedback, when dbas are ok with the changes
[15:16:43] <jynus>	 as they handle dbs on that dashboard, too
[18:38:07] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s7 on db1202 is CRITICAL: 1284 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1202&var-port=9104
[18:38:46] <federico3>	 looking
[18:39:31] <federico3>	 it's healthy now
[18:40:54] <federico3>	 icinga-wm: wake up
[18:42:07] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s7 on db1202 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1202&var-port=9104