[00:30:56] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:56] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:02:16] s7 eqiad switched [06:03:34] federico3: I will let you know when you can apply your schema change to db1181, let me reimage first. [06:04:15] @marostegui thanks [06:06:20] db1181 also needs a kernel update, db1236 was updated yesterday [06:40:55] federico3: db1181 it is being reimaged :) [06:41:38] ok! [07:13:40] federico3: db1181 is ready for youi [07:13:46] it is pooled [07:13:48] ok thanks [07:27:05] The rclone errors are because we lost a race with the rename in https://commons.wikimedia.org/w/index.php?title=File:Hepatitis_death_rate,_1980_to_2021,_GBR.svg&action=history [07:28:56] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:28] m2 backups are now taking almost 8 hours :-( [09:02:22] compared to s4's 1h30m [09:03:34] Time to Delete All The Things :) [09:04:33] jynus: I guess vrts :( [09:04:59] I can check [09:11:32] marostegui: 100% right https://phabricator.wikimedia.org/P93507 [09:11:45] :_( [09:15:57] federico3: this ticket seems to have been done? https://phabricator.wikimedia.org/T426095 [09:16:54] yes I'll close it [09:17:04] +1 [11:03:50] Starting s8 codfw switchover [11:05:33] 🤞 [11:19:42] s8 codfw done [11:22:06] thanks! [12:19:08] '(A:db-section-s8 or A:db-section-x3) and (A:db-sanitarium or A:db-clouddb-sanitization)' - this doesn't find clouddb10[22,23], the alias doesn't define them - is that an oversight, or expected? [12:20:49] cezmunsta: I think that is by design, dbs and clouddbs have different owners [12:21:09] as well as different access levels, vlan, etc. [12:21:34] db-clouddb-sanitization: P{an-redacteddb1001.eqiad.wmnet or clouddb1016.eqiad.wmnet or clouddb1020.eqiad.wmnet} [12:21:36] it is true it is confusing having 2 ways to clasify db hosts [12:21:42] I am not sure why the others aren't there [12:21:45] by section/content/replica set [12:21:50] dhinus: ^ [12:22:03] any idea why? [12:22:04] and by function [12:22:21] ah, I know why [12:22:37] it's a holiday in italy today, d.hinus is not around [12:22:42] you are using a host tool, but section is really an instance classification [12:23:06] clouddb is not just s8, it is probably s8 + other stuff [12:23:21] taavi: ah thanks! [12:23:53] so cumin is not great for instance orchestration, and why I set up to create the zarcillo db, which has host and instance planes [12:24:30] jynus: I think you may be confusing things. The cumin alias should return all clouddb, but in that alias that I pasted only 2 are there, and I am not sure why only 2, let's check with wmcs team [12:24:44] ah, yes, I am getting confused [12:25:08] so my problem is real but unrelated [12:25:21] jynus: I think cezmunsta means only the cumin aliases [12:25:43] And the one I pasted above only has those 2, but I don't know why, those hosts aren't different from any other [12:25:46] yep [12:25:47] So let's wait for dhinu.s tomorrow [12:25:51] The alias was set/last changed in 2024, perhaps the missing ones are newer than that? [12:26:55] apologies for consusing you [12:27:00] *confusing [12:27:27] jynus: no worries :) [12:28:45] > perhaps the missing ones are newer than that [12:29:02] ^ is true [12:29:04] and yesterday I remembered another issue why I stopped automating dump loading [12:29:24] and I try to phish any new hire into fixing this [12:29:34] cezmunsta: Those hosts are older than 2024, so they should be there unless there's a reason for it that we don't know - let's see what wmcs team says [12:30:17] I was blocked on grant handling being fully automated [12:30:44] marostegui: ack .. maybe just the hieradata has partial history then [12:30:55] jynus: skip-grant-tables ? [12:31:12] cezmunsta: what do you mean? [12:31:55] Aren't the grants in the dump? [12:32:10] nope, they are in puppet [12:32:23] but there is no automatic mechanism to load them atm [12:32:30] they are on the snapshots, though [12:33:41] they shouldn't be on puppet, however, they should be on its own orchestration for easy password change,etc (some of that is automated, but not fully integrated) [12:33:58] I think there is a ticket for that, I don't what to look at that [12:34:09] but just explaining the questions you had for me [12:34:37] (I mean, I would love if you looked at that, I know you cannot do that atm, you dbas have larger priorities) [12:35:35] I guess that https://phabricator.wikimedia.org/T427884 will shed a little more light on this :) [12:35:59] let me find the master ticket [12:37:04] this is the epic, some of those are fixed: https://phabricator.wikimedia.org/T146149 [12:37:34] ack [12:37:36] that was some of the first audits I did and the dream for me [12:38:05] to be able to close, but again, if you look at the year, thats too large for a small project [12:39:21] and I wouldn't discard some passwordless authentication system to handle that [12:40:10] things looked bad in 2016, now there is automation, monitoring, etc. [12:40:20] but we could do better [12:43:03] marostegui: what would you think about resolving T199061, I think the original scope is fullfilled and we should open one for an improvement [12:43:27] yeah, that's done [12:43:30] (not now) [12:43:53] resolvit now , I mean, improve at a later time [12:44:39] https://phabricator.wikimedia.org/T199061#4600803 I guess almost 8 years of time running it is enough to see it worked fine XD [12:44:45] yeah [12:45:09] so it is not as bad as 2016, cezmunsta, but it could be a long term project (not for me to say) [12:46:22] my pet peeves would be that and the processlist monitoring, as big blockers of other projects or debugging [12:47:25] https://www.percona.com/blog/using-vault-mysql/ - no need to create the users, just the management user ... also from 2016 :) [12:47:58] <3 [12:48:29] I am not to worried about the stack, that's the easy part tbh [12:48:49] it is more about integrating it, customizing it to our needs, etc. [12:49:13] but I would love for something like that to happen [12:50:11] as for recoveries I was like: "why should work on fully automation if we still need to copy and paste to setup a new host" [12:50:17] *I work [12:51:02] What I am proud of is, despite not being automated, document grants offline on puppet [12:52:21] I think that will pay off when that happens, but now it is very hard and cumbersom to work with flat file texts, what could be a relational db or vault or something like that [12:54:16] another thing I learned is that code doesn't matter in the long term, architecture does. Code comes and goes, but architecture stays forever. [14:07:27] Icinga "MariaDB sustained replica lag on
" caused another incomplete cookbook run when reimaging [14:08:58] incomplete as in "need to reimage again" or just a non pass because some things were red? [14:09:12] if the second, you can ignore those [14:09:46] Whilst it waits for up to 15 iterations of the "all clear" check before repooling [14:12:08] yeah, it is ok to be red while it recovers, as long as it gets eventually green [14:13:44] The total number of attempts = 15 and each one is previous_wait + 3s... so that is really tight on the "all green" from Icinga [14:15:20] yeah, those are thought for stateless hosts [14:15:34] imagine, when provisioning a host it takes way more than that [14:16:43] we tried having a few meetings with infra so they can better catter our needs [14:25:59] It would seem trivial to be able to pass in a delay instead of hardcoding timedelta(seconds=3) for the retry [14:27:25] We could equally catch those exceptions when suitable and retry the wait in the cookbook [14:28:21] It is the second time that it actually broke on me and there were a number of last-iteration passes that have occurred, so nearly more [14:32:25] my swift-in-pontoon cluster can now successfully run our rewrite_integration_test file :) [14:34:24] *\o/* [14:37:15] (and contains 3 originals and one thumbnail, as well as the dispersion-check guff) [15:07:16] moritzm: I am happy that I did the homework early for trixie (all my package seem to be ready): https://phabricator.wikimedia.org/T427897#11976933 [15:11:50] ^ this are the typical issues I find when upgrading to trixie cezmunsta (my package is most likely ok, but dependencies were not fully ready) [15:11:59] *these [15:12:36] and sadly the coupling of mariadb for me is stronger with the rest of backup infra, so I need some extra time for testing [15:22:35] ack [15:24:13] jynus: what did SubjectAltNameWarning come from? [15:32:42] My guess some python library change on urllib, but that's for cumin maintainers to figure out, supporing new python version/package version [15:37:17] What were all those p4ge? I just saw them on my email [15:41:49] jynus: commit fd0c475cc2c51aedb6c89d7b9be58d850966ee6a from 2020 [15:42:40] https://github.com/urllib3/urllib3/commit/fd0c475cc2c51aedb6c89d7b9be58d850966ee6a [15:45:21] marostegui: "yeah, change applied to one rack went fine, but not the other" I think was the cause that you are looking for [15:54:06] riiiiight [15:54:15] Related to a3 maintenance then [15:54:49] PROBLEM - MariaDB sustained replica lag on s6 on db2158 is CRITICAL: 21.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [15:55:01] PROBLEM - MariaDB sustained replica lag on s6 on db2169 is CRITICAL: 27.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104 [15:58:51] RECOVERY - MariaDB sustained replica lag on s6 on db2158 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104 [15:59:01] RECOVERY - MariaDB sustained replica lag on s6 on db2169 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104 [18:01:02] PROBLEM - MariaDB sustained replica lag on s6 on db2169 is CRITICAL: 13.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104 [18:04:02] RECOVERY - MariaDB sustained replica lag on s6 on db2169 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104