[08:48:28] apparently db1150 crashed or something [08:57:38] yes [08:59:08] MegaRAID CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. [08:59:45] a dbstore multiinstance [09:04:44] File /var/log/journal/2e7308cf319647999b444febbd9ad271/system.journal corrupted or uncleanly shut down, renaming and replacing. <--- not a good sign [09:11:28] federico3: I am handling it https://phabricator.wikimedia.org/T405885 [09:18:24] shall we replace memory anyways? [09:24:12] I need to handle backups first, that is the priority [09:24:19] please don't handle this, as I am doing it [09:24:31] db1150 is a backup source [09:25:30] if you want to help me, please have a look at T405711 [09:25:31] T405711: cumin2002 and cumin1003 doesn't have grants to be able to administrate all databases that require backups - https://phabricator.wikimedia.org/T405711 [09:34:51] ok [09:42:31] I am upgrading db1245 so it can take over db1150 backups while servicing it [09:43:40] which means I had to stop mediabackups on eqiad, what a waterfall of cause and effect [10:28:42] having handled all my alerts, and being blocked on backup completion, I am going to finally have my morning coffee to properly wake up [13:03:40] kavitha: are you joining us for the team meeting today? [13:04:22] Yes, Okta is acting [13:04:30] Will be there in a minute or two [13:05:03] 👍 [14:56:30] s4 backups got recovered, pending s3 finishing [14:59:18] \o/ [15:28:38] jynus: you should have grants working from the 3 cumin hosts for T405711 [15:28:39] T405711: cumin1002, cumin1003,cumin2002 do not have grants to be able to administrate all databases that require backups - https://phabricator.wikimedia.org/T405711 [15:36:26] Thanks, could the task maybe expanded (of course, not to fix now, but to descide at some point) what to provide access to and what not. My biggest pet peeve is that everything that should be on zarcillo should have access (but ok to remove it if we decide it should not) [15:37:10] basically, IMHO either we should provice access from cumin and be tracked on zarcillo or be removed from both [15:37:32] but that can be discussed long term, I don't need a fix for that now [17:23:05] I've started the data recovery of db1150, will wait until tomorrow to decide what to do with it [18:28:04] FIRING: MysqlPredictiveFreeDiskSpace: Host db1150:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [18:30:38] siiiiiiigh [18:31:07] ah, it's back up source and it's already being taken care of [18:31:10] Thank you! [19:48:04] RESOLVED: MysqlPredictiveFreeDiskSpace: Host db1150:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [20:09:27] that's exactly why I said I don't like predictive warnings, they will get false positives during provisioning [20:09:39] it will fire again when loading s3 [20:10:00] predictive disk space thingies should be just a dashboard, not an alert [21:37:04] FIRING: MysqlPredictiveFreeDiskSpace: Host db1150:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [21:47:04] RESOLVED: MysqlPredictiveFreeDiskSpace: Host db1150:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace