[00:52:34] FIRING: [3x] DiskSpace: Disk space backup1010:9100:/srv/objectstorage 2.069% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:52:34] FIRING: [3x] DiskSpace: Disk space backup1010:9100:/srv/objectstorage 2.069% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:48:08] PROBLEM - MariaDB sustained replica lag on s6 on db2158 is CRITICAL: 13.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [06:51:08] RECOVERY - MariaDB sustained replica lag on s6 on db2158 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [09:01:12] anyone for a personal dotfile update review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1169640 [09:24:20] https://usercontent.irccloud-cdn.com/file/GwTveCEo/image.png [09:24:41] @marostegui @cezmunsta what do you think? I can roll it out now [09:25:02] federico3: context? [09:25:25] @marostegui it's the host silences as we were discussing last week [09:25:55] federico3: ah, thanks, I like it. Let's bring it as a topic for the meeting later today so we can review [09:26:03] ok! [09:26:25] thanks [10:11:17] Any ideas what happened to jekins-bot on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1302098 ? [10:11:38] s/jekins/jenkins [10:11:52] cezmunsta: Sometimes, if it gets stuck, just add a comment: recheck [10:11:56] And it may come back [10:12:15] I thought that the rebase would trigger that, no? [10:12:25] I will try a comment too [10:12:56] cezmunsta: https://media.makeameme.org/created/life-is-what-c9cba06f64.jpg [10:14:38] Normally down to what it has been configured to do from my experience :D [10:15:55] "Starting gate-and-submit job" was what it was stuck on before [12:35:35] federico3: this is the section of user_grant_handler that I was referring to https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/blob/main/user_grant_handler.py#L242-260 [12:37:21] cezmunsta: those are the queries we want to execute and the script is meant to be modified locally to put the desired ones similarly to other scripts. They get populated without hardcoding ipaddr/usernames/passwords. [12:37:43] I don't like the editing of code to run it [12:37:47] (we could load them from a json file instead) [12:38:04] If I ran that code now then wouldn [12:38:11] 't it drop the user? [12:38:12] neither do I, but it's a pattern that existed in simiar tools [12:38:52] if you prefer we can just load a little json/yaml/toml file [12:39:26] Why not SQL file, one DML per line and read line-by-line? [12:40:37] the modify queries could be more than one.. maybe 2 sql files? [12:43:27] Using argparse.FileType for files, then add an argument per source file [12:47:35] cezmunsta: how about we add the loading of queries and move the whole script into a cookbook? [12:52:31] federico3: +1 ... mind if I give that a go, as I was planning to do that, hence the speed-up of omg so that collecting grants before and after was fast enough to use [12:56:44] cezmunsta: ok, you'd need to create a cookbook and its test file by grabbing both from https://gitlab.wikimedia.org/ladsgroup/db-password-rotation and the syrupy __ambr__ files and add the test deps to pyproject.toml (I asked elukey and it should be ok) [12:58:54] I will take a look and let you know if I've any queries etc [13:14:44] marostegui: I wrote https://wikitech.wikimedia.org/wiki/MariaDB/Upgrading_a_section#Upgrading_sanitarium_masters - but how to identify all the hosts to be downtime across all the sections? [13:15:09] federico3: as you are going to do both hosts, you can do clouddb* and an-redact* [13:15:14] as all will be impacted [13:15:25] federico3: that doc is not correct [13:17:57] federico3: I used 'A:db-section-x3 and (A:db-sanitarium or A:db-clouddb-sanitization)' and then 'clouddb10[22,23]*' when doing x3 [13:18:03] federico3: sanitarium hosts do not have misc section, they run s* sections. https://phabricator.wikimedia.org/P94142 [13:19:20] @marostegui thanks, updating [13:55:14] @marostegui there's also an x3 instance [13:55:31] federico3: yes, it is there, 3363 [14:05:00] Doing db1196 whilst sanitarium reboots take place [14:06:45] cezmunsta: downtime it esplicitely [14:07:34] Yep, if they are all green once the reimage completes then I will remove the downtime, else ping you [14:16:44] both hosts are rebooted and MariaDB running [14:19:38] ack [14:32:34] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:03] federico3: do you happen to know if db1262 got restarted around 12h ago by any rolling restart script? I am investigating if it got restarted by us or crashed on its own [14:34:42] reboot system boot 6.1.0-48-amd64 Mon Jun 15 02:02 still running [14:35:31] I think it crashed cause mariadb did a recovery, which wouldn't happen if we did a normal reboot [14:36:14] Does journald use persistent logs? [14:37:55] @marostegui the restarts were finished many days before IIRC [14:39:44] around the 3/4 jun [14:44:06] federico3: great [14:44:08] thanks [15:21:00] federico3: I have cleared the downtime from earlier [15:22:50] oddly, 'db115[45]*' could not be used as it produced "no hosts provided" - any ideas? I did them separately instead [15:27:13] I did them separately as well before trying to create the regex [15:33:06] somehow some packages were built https://gitlab.wikimedia.org/repos/sre/wmfdb/-/jobs?kind=BUILD [15:49:39] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed