[07:37:00] federico3: I think https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1120214 may have broken the cloning cookbook: https://phabricator.wikimedia.org/P73924 [07:59:40] PROBLEM - MariaDB sustained replica lag on s4 on db1190 is CRITICAL: 94 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1190&var-port=9104 [08:02:40] RECOVERY - MariaDB sustained replica lag on s4 on db1190 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1190&var-port=9104 [08:27:49] FIRING: PuppetFailure: Puppet has failed on ms-be1080:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:28:53] yeah, dead disk. I'll get a ticket opened in a little and then downtime it for longer [09:13:12] volans: are there ways to check out different version of a cookbook on a cumin host? If not, would you recommend alternatives to provide different versions? [09:14:13] e.g. using test-cookbook ? [09:14:38] federico3: what is the use case? [09:15:07] being able to test incremental improvements and rollback as needed [09:17:05] if incremental improvements are in their own CRs, tested and merged, usually there is no need to have multiple versions of something. [09:17:47] multiple CRs that are chained (in the same branch) can be tested with the test-cookbook because if you test the last one in the chain it will contain the others too. [09:24:13] if something needs to be rewritten from scratch without touching the original it's always possible to create a new cookbook. Usually it's not needed. [09:24:16] hope that helps [09:30:05] well I used test-cookbook already, ok we can use this [09:30:30] does it log which version is used, when, by whom? [09:42:26] the logs are in your home (see the help message from test-cookbook). I don't think we log into file the actual invocation, we can probably easily add it, but the test-cookbook allows also to test local modifiations on disk so it those cases it would just say with local modifications [10:26:54] Emperor: the thumbnail def-fragment patch is merged in core and enabled in beta cluster (with 50% of images) [10:27:10] Thanks to James_F <3 [10:29:25] cool, I guess we see if there are complaints or fire (both unlikely on beta?) [10:31:20] yup, once I'm back, I'll enable it on production on 1% or so [10:31:30] (currently in an airport :D) [10:52:13] PROBLEM - MariaDB sustained replica lag on s8 on db2166 is CRITICAL: 4559 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2166&var-port=9104 [10:52:23] looking [10:54:26] db2166 pa.ged, federico let me know if you need anything. I'll leave the troubleshooting to you if that's ok? (I'm oncall at the moment) [10:55:10] depooling it [10:55:50] it was already depooled when I tried it [10:56:04] great thanks [11:06:15] RECOVERY - MariaDB sustained replica lag on s8 on db2166 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2166&var-port=9104 [11:11:35] federico3: if it was automation, maybe a task can be setup to see what can be changed to avoid alerting (I don't have the context, maybe it makes no sense what I am saying) [11:11:52] yes I'm changing the tooling as we speak :D [11:12:05] But if it makes sense, creating a task or commeting on an existing one would be useful [11:42:15] PROBLEM - MariaDB sustained replica lag on s4 on db2206 is CRITICAL: 78.5 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2206&var-port=9104 [11:42:45] PROBLEM - MariaDB sustained replica lag on s4 on db1249 is CRITICAL: 125 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1249&var-port=9104 [11:45:15] RECOVERY - MariaDB sustained replica lag on s4 on db2206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2206&var-port=9104 [11:45:45] RECOVERY - MariaDB sustained replica lag on s4 on db1249 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1249&var-port=9104 [14:13:41] PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 6231 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [14:14:53] federico3: db1248 p.aged, is that your host? [14:16:55] replied on -operation :) [14:17:15] thanks yep [14:27:00] <_joe_> Please begin without me, I will be a few minutes late [14:27:18] <_joe_> sorry, brought into a meeting 15 minutes ago and trying to get out of it [14:29:08] 🎉 [14:56:41] RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [17:37:40] can I migrate the MySQL prometheus dashboard panels to the new "Time series" visualization plugin? Also set the charts as "step" for better clarity? [17:38:12] federico3: Can you create one as an example so we can see it? [17:38:44] for the first question or the second? [17:39:29] federico3: The end result you want to achieve :) [17:40:51] https://grafana-rw.wikimedia.org/d/000000273/mysql?forceLogin&from=now-3h&orgId=1&refresh=1m&to=now&var-job=All&var-port=9104&var-server=db1176 [17:41:36] there's two "Write Query Stats": the one of the right is the new Time series (when I remove it the panels will reflow correctly in place) [17:47:38] regarding the step chart, see two "Sorting" panels in https://grafana-rw.wikimedia.org/d/000000273/mysql?forceLogin&from=now-3h&orgId=1&refresh=1m&to=now&var-job=All&var-port=9104&var-server=db1176 [17:48:42] the line interpolation hides datapoints and "lies" a little bit. The step mode shows each datapoint [22:22:18] Well, Beta cluster's thumbor is not very healthy https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/2/20/OOjs_UI_icon_heart-invert.svg/240px-OOjs_UI_icon_heart-invert.svg.png [22:22:36] https://en.wikipedia.beta.wmflabs.org/wiki/Special:ListFiles [22:26:11] beta cluster doesn't have a thumbor [22:26:36] it's supposed to be doing it via native mw thumbnailling afaik, but I don't really know how that works too well [22:27:05] that explains a lot. Thanks