[05:56:27] jynus: let me know when I can stop m1 backup source to clone another host [05:58:03] let me double check [05:59:23] yeah, any time until monday night [05:59:49] ^ marostegui [06:00:41] Great! Thank you jynus [06:05:12] checking the email about long running backups [06:05:42] I am pretty sure it is a false positive about backups getting stuck due to running out of disk available [06:38:44] federico3: if you want to meet re:database automation, we can do it today until 16CEST, otherwise, it will have to happen next week [07:04:22] jynus: next week would be better if you don't mind [07:07:11] 👍 [07:59:49] I think a grafana update or something broke the mysql dashboard [08:00:16] the version box now is very hard to read [08:16:23] I made it a bit better, but maybe a better grafana expert can fix it better [08:20:01] jynus: I think I fixed the same issue in grafana.wmcloud, but of course I don't remember how :) [08:20:22] I'm trying to check the difference between the two dashboards [08:20:26] I added a transformation [08:20:36] and then made the value instant [08:21:01] any other tips to tranform the label to a proper value? I am looking at other transformations [08:21:05] here it works correctly: https://grafana-rw.wmcloud.org/d/PTtEnEyVk/toolsdb-mariadb [08:21:29] I'm going to steal your solution, whatever it was [08:21:30] thanks [08:22:23] I think it was just the formatting [08:24:29] ok found it, there are a series of things to change. do you want me to save the fixed version? [08:24:43] ah ok you fixed it already [08:24:50] oh, if you did it already, yead, please [08:25:06] but yours has the nice color [08:25:10] ok, I'm gonna save, then you can always revert [08:25:12] was that just formatting? [08:25:16] please do [08:26:01] it was formatting + selecting "table" in the format, plus selecting "Stat" as the graph type [08:26:24] plus selecting field: version in the Stat graph options [08:26:31] I don't see the change, maybe it was my cache [08:26:39] so I missed that step [08:27:29] are you looking at this dashboard? maybe there are multiple ones to fix https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb [08:27:49] yeah, that's not the right one [08:27:53] the mysql one is [08:28:20] this one? https://grafana-rw.wikimedia.org/d/000000273/mysql [08:28:25] yeah [08:28:33] but I can copy and paste the json now if you want [08:30:02] I'll fix it manually, give me a sec [08:30:25] ok, then not touching it [08:32:25] So I was in the right direction, just I am terrible at CSS :-D [08:32:33] LOL [08:33:42] fixed [08:34:09] nice, it looks gorgeous now [08:35:19] you can endorse me for "Grafana layout" in LinkedIn :D [08:36:11] I am going to prettyfy some other stuff, like the lack of units and so on [08:41:21] dhinus: check some of the later alterations in case they interest to import to wmflabs [08:41:28] *interest you [08:43:12] I based a lot of stuff on percona graphs, I may check what they have as default to tune things in the future [08:55:40] thanks, I'll have a look! yeah I also looked at percona's github at some point to find which queries they were using [09:07:17] Amir1: are you back? [09:18:48] I did a full facelift to the mysql graph, but I didn't touch the data/queries. That will need a deeper review later on. [09:34:48] I think it is more clear now [09:41:06] marostegui: what are these connection drops we see on all instances? https://grafana.wikimedia.org/goto/j2dhAfYNg?orgId=1 [09:51:01] marostegui: can I start kernel upgrades in s1 codfw? [09:57:02] Yes [09:57:11] I'll check that graph later [10:11:10] federico3: I'm back, what's up? [10:52:07] Amir1: a heads up about the script for reboot planning: it needed removing a puppet group that was probably dropped; also we have one server where to do the auto_schema change after it has been flipped [10:54:24] can we validate that drop_afl_patrolled_by_from_abuse_filter_T391056.py is going to run only on the right server? [10:54:51] run it with --check? [10:55:07] and then set the replicas value, replicas value overrides everything [11:02:51] Amir1: uhm in the other example scripts it's set to None [11:03:21] yeah, that's for when you need to run it everywhere [11:03:49] but if you need to run it on only one or two replcias cause switchover, recovery, etc. then set the value [11:03:53] ['db1234'] [11:14:43] uhm Result: {"already done in all dbs": ["db2229"]} (from https://phabricator.wikimedia.org/T396509 ) [11:41:35] you're running the afl change? We ran it live on master on s6, only s1,s4, and s8 needed switchover [11:44:15] ah yes, s6 was done. Well most kernel updates are done so I'll generate the switchover tasks [13:03:21] s6 eqiad backup ran correctly [14:29:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1219:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:10] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1219:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:46] hello data-persistence friends, just a quick heads-up: I'm going to be removing the dbctl packages from the old buster-based puppetmaster-frontend hosts (`puppetmaster[12]001`). [14:42:46] ideally no one should be using those for management tasks, and indeed the packages are no longer installed by puppet (i.e., they're "abandoned" there for historical reasons), but still wanted to flag it for visibility :) [14:42:54] more details in the discussion on https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/82 [15:26:59] the prometheus metric mysql_exporter_collector_duration_seconds is including role and shard as metric dimensions. Do we trust it as an authoritative source? [15:34:00] it only replicates what's setup on puppet [15:43:54] {{done}} (dbctl has been removed from `puppetmaster[12]001`) [23:01:24] PROBLEM - MariaDB sustained replica lag on s2 on db1254 is CRITICAL: 439.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1254&var-port=9104 [23:06:22] RECOVERY - MariaDB sustained replica lag on s2 on db1254 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1254&var-port=9104 [23:25:48] I downtimed it... [23:37:12] PROBLEM - MariaDB sustained replica lag on s4 on db1252 is CRITICAL: 246.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104 [23:38:47] again, downtimed, I think firmware update cookbook removes the downtime [23:39:12] RECOVERY - MariaDB sustained replica lag on s4 on db1252 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104