[00:09:22] PROBLEM - MariaDB sustained replica lag on s1 on db1251 is CRITICAL: 722.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1251&var-port=9104 [00:26:22] RECOVERY - MariaDB sustained replica lag on s1 on db1251 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1251&var-port=9104 [05:12:57] swfrench-wmf: Sorry no one got back to you on your phabricator question. I think it is fine to keep both tags and one of us will decide. Normally we just keep DBA if we do have to do something, otherwise (example consultation or something blocked for a long time on someone else) we only leave data-persistence. [05:37:47] db2212 is down, can someone check? it is the candidate master [07:20:01] looking [07:20:32] thank you! [07:23:26] "it's dead, Jim" [07:24:36] https://grafana.wikimedia.org/goto/Wk8zXELHg?orgId=1 all the metrics are gone and it's not responding to ssh. [07:43:30] marostegui: i'm not seeing anything in "sel elist" in ipmitool, any other thing I can check before hard resetting the host? [07:55:46] Anything from getsel? [07:56:59] marostegui: i logged on idrac and it seems there was some activity done on the host recently [07:57:00] https://phabricator.wikimedia.org/P77906 [07:57:33] from line 1 to 6 it's me [07:58:10] this was not me -> 2025-06-11 15:30:09 USR0030 Successfully logged in using root, from 10.64.48.98 and REDFISH. [07:59:44] From SAL looks like the host was rebooted yesterday and never came back? [08:04:10] https://phabricator.wikimedia.org/P77907 [08:04:17] yes [08:04:34] this is getsel and I'm not seeing much [08:05:06] BTW the chassis led is on [08:05:09] Run a hard reset [08:05:14] And we'll see [08:05:30] But if it doesn't come back you'll have to ask dcops to check it onsite [08:06:28] ok doing hard reset [08:08:01] it's not even powering up [08:08:02] " The System Configuration Check operation resulted in the following issue: Comm Error: Backplane 0. " [08:08:50] idrac reports " SYSTEM HAS CRITICAL ISSUES " which does not bode well, [08:14:36] i see the same error reported to dc-ops for other servers in phab. Opening task for them [08:22:42] Then yes if it's not even powering on, task required [08:22:44] Thanks [09:25:19] federico3: I don't know if you've done it already, but you will have to disable notifications, otherwise the host will page once the downtime (from the reboot) expired [09:25:56] (sorry I'm in a meeting, I'll do it) [09:26:00] np! [10:01:27] marostegui: I'm seeing connection drops across multiple sections e.g. https://grafana.wikimedia.org/goto/_o4RMyYHR?orgId=1 while investigating https://phabricator.wikimedia.org/T396454 - is this a mediawiki deployment? [10:24:57] for tasks relating to a server or instance can we put the name as a task tag? [10:34:39] projects cannot be created without dicussion: https://www.mediawiki.org/wiki/Phabricator/Project_management#Projects [10:35:13] I suggest for now the name of the server is used on the title, that way it makes it more easily searchable [13:02:11] marostegui: great, thank you for confirming! I'll go ahead and make the necessary alertmanager changes [13:02:27] Thank you! [14:50:29] jynus: mysql_exporter natively supports the heartbeat table: https://github.com/prometheus/mysqld_exporter?tab=readme-ov-file#collector-flags [14:51:42] yes, but does it with the shard and dc? [14:52:35] e.g. does it understand the multiple rows? [14:52:58] (not asking really, more of something to have into account) [16:19:35] can I scrape netbox.wikimedia.org/api/dcim/devices/ without auth from k8s? I don't think so, and if not where can I get a token from? [16:21:28] e.g. I can create a read-only token from my own user account in netbox but is that ok? [17:19:25] FIRING: SystemdUnitFailed: check-private-data.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:40] FIRING: SystemdUnitFailed: check-private-data.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed