[00:30:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:30:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:02:16] <marostegui>	 s7 eqiad switched
[06:03:34] <marostegui>	 federico3: I will let you know when you can apply your schema change to db1181, let me reimage first.
[06:04:15] <federico3>	 @marostegui thanks
[06:06:20] <federico3>	 db1181 also needs a kernel update, db1236 was updated yesterday
[06:40:55] <marostegui>	 federico3: db1181 it is being reimaged :)
[06:41:38] <federico3>	 ok!
[07:13:40] <marostegui>	 federico3: db1181 is ready for youi
[07:13:46] <marostegui>	 it is pooled
[07:13:48] <federico3>	 ok thanks
[07:27:05] <Emperor>	 The rclone errors are because we lost a race with the rename in https://commons.wikimedia.org/w/index.php?title=File:Hepatitis_death_rate,_1980_to_2021,_GBR.svg&action=history
[07:28:56] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:01:28] <jynus>	 m2 backups are now taking almost 8 hours :-(
[09:02:22] <jynus>	 compared to s4's 1h30m
[09:03:34] <Emperor>	 Time to Delete All The Things :)
[09:04:33] <marostegui>	 jynus: I guess vrts :(
[09:04:59] <jynus>	 I can check
[09:11:32] <jynus>	 marostegui: 100% right https://phabricator.wikimedia.org/P93507
[09:11:45] <marostegui>	 :_(
[09:15:57] <cezmunsta>	 federico3: this ticket seems to have been done? https://phabricator.wikimedia.org/T426095 
[09:16:54] <federico3>	 yes I'll close it
[09:17:04] <cezmunsta>	 +1
[11:03:50] <cezmunsta>	 Starting s8 codfw switchover
[11:05:33] <marostegui>	 🤞
[11:19:42] <cezmunsta>	 s8 codfw done
[11:22:06] <marostegui>	 thanks!
[12:19:08] <cezmunsta>	 '(A:db-section-s8 or A:db-section-x3) and (A:db-sanitarium or A:db-clouddb-sanitization)' - this doesn't find clouddb10[22,23], the alias doesn't define them - is that an oversight, or expected?
[12:20:49] <jynus>	 cezmunsta: I think that is by design, dbs and clouddbs have different owners
[12:21:09] <jynus>	 as well as different access levels, vlan, etc.
[12:21:34] <marostegui>	 db-clouddb-sanitization: P{an-redacteddb1001.eqiad.wmnet or clouddb1016.eqiad.wmnet or clouddb1020.eqiad.wmnet}
[12:21:36] <jynus>	 it is true it is confusing having 2 ways to clasify db hosts
[12:21:42] <marostegui>	 I am not sure why the others aren't there
[12:21:45] <jynus>	 by section/content/replica set
[12:21:50] <marostegui>	 dhinus: ^
[12:22:03] <marostegui>	 any idea why? 
[12:22:04] <jynus>	 and by function
[12:22:21] <jynus>	 ah, I know why
[12:22:37] <taavi>	 it's a holiday in italy today, d.hinus is not around
[12:22:42] <jynus>	 you are using a host tool, but section is really an instance classification
[12:23:06] <jynus>	 clouddb is not just s8, it is probably s8 + other stuff
[12:23:21] <marostegui>	 taavi: ah thanks!
[12:23:53] <jynus>	 so cumin is not great for instance orchestration, and why I set up to create the zarcillo db, which has host and instance planes
[12:24:30] <marostegui>	 jynus: I think you may be confusing things. The cumin alias should return all clouddb, but in that alias that I pasted only 2 are there, and I am not sure why only 2, let's check with wmcs team
[12:24:44] <jynus>	 ah, yes, I am getting confused
[12:25:08] <jynus>	 so my problem is real but unrelated
[12:25:21] <marostegui>	 jynus: I think cezmunsta means only the cumin aliases
[12:25:43] <marostegui>	 And the one I pasted above only has those 2, but I don't know why, those hosts aren't different from any other 
[12:25:46] <cezmunsta>	 yep 
[12:25:47] <marostegui>	 So let's wait for dhinu.s tomorrow
[12:25:51] <cezmunsta>	 The alias was set/last changed in 2024, perhaps the missing ones are newer than that?
[12:26:55] <jynus>	 apologies for consusing you
[12:27:00] <jynus>	 *confusing
[12:27:27] <cezmunsta>	 jynus: no worries :)
[12:28:45] <cezmunsta>	 > perhaps the missing ones are newer than that
[12:29:02] <cezmunsta>	 ^ is true
[12:29:04] <jynus>	 and yesterday I remembered another issue why I stopped automating dump loading
[12:29:24] <jynus>	 and I try to phish any new hire into fixing this
[12:29:34] <marostegui>	 cezmunsta: Those hosts are older than 2024, so they should be there unless there's a reason for it that we don't know - let's see what wmcs team says
[12:30:17] <jynus>	 I was blocked on grant handling being fully automated
[12:30:44] <cezmunsta>	 marostegui: ack .. maybe just the hieradata has partial history then
[12:30:55] <cezmunsta>	 jynus: skip-grant-tables ?
[12:31:12] <jynus>	 cezmunsta: what do you mean?
[12:31:55] <cezmunsta>	 Aren't the grants in the dump?
[12:32:10] <jynus>	 nope, they are in puppet
[12:32:23] <jynus>	 but there is no automatic mechanism to load them atm
[12:32:30] <jynus>	 they are on the snapshots, though
[12:33:41] <jynus>	 they shouldn't be on puppet, however, they should be on its own orchestration for easy password change,etc (some of that is automated, but not fully integrated)
[12:33:58] <jynus>	 I think there is a ticket for that, I don't what to look at that
[12:34:09] <jynus>	 but just explaining the questions you had for me
[12:34:37] <jynus>	 (I mean, I would love if you looked at that, I know you cannot do that atm, you dbas have larger priorities)
[12:35:35] <cezmunsta>	 I guess that https://phabricator.wikimedia.org/T427884 will shed a little more light on this :)
[12:35:59] <jynus>	 let me find the master ticket
[12:37:04] <jynus>	 this is the epic, some of those are fixed: https://phabricator.wikimedia.org/T146149
[12:37:34] <cezmunsta>	 ack
[12:37:36] <jynus>	 that was some of the first audits I did and the dream for me
[12:38:05] <jynus>	 to be able to close, but again, if you look at the year, thats too large for a small project
[12:39:21] <jynus>	 and I wouldn't discard some passwordless authentication system to handle that
[12:40:10] <jynus>	 things looked bad in 2016, now there is automation, monitoring, etc.
[12:40:20] <jynus>	 but we could do better
[12:43:03] <jynus>	 marostegui: what would you think about resolving T199061, I think the original scope is fullfilled and we should open one for an improvement
[12:43:27] <marostegui>	 yeah, that's done
[12:43:30] <jynus>	 (not now)
[12:43:53] <jynus>	 resolvit now , I mean, improve at a later time
[12:44:39] <marostegui>	 https://phabricator.wikimedia.org/T199061#4600803 I guess almost 8 years of time running it is enough to see it worked fine XD
[12:44:45] <jynus>	 yeah
[12:45:09] <jynus>	 so it is not as bad as 2016, cezmunsta, but it could be a long term project (not for me to say)
[12:46:22] <jynus>	 my pet peeves would be that and the processlist monitoring, as big blockers of other projects or debugging
[12:47:25] <cezmunsta>	 https://www.percona.com/blog/using-vault-mysql/ - no need to create the users, just the management user ... also from 2016 :)
[12:47:58] <jynus>	 <3
[12:48:29] <jynus>	 I am not to worried about the stack, that's the easy part tbh
[12:48:49] <jynus>	 it is more about integrating it, customizing it to our needs, etc.
[12:49:13] <jynus>	 but I would love for something like that to happen
[12:50:11] <jynus>	 as for recoveries I was like: "why should work on fully automation if we still need to copy and paste to setup a new host"
[12:50:17] <jynus>	 *I work
[12:51:02] <jynus>	 What I am proud of is, despite not being automated, document grants offline on puppet
[12:52:21] <jynus>	 I think that will pay off when that happens, but now it is very hard and cumbersom to work with flat file texts, what could be a relational db or vault or something like that
[12:54:16] <jynus>	 another thing I learned is that code doesn't matter in the long term, architecture does. Code comes and goes, but architecture stays forever.
[14:07:27] <cezmunsta>	 Icinga "MariaDB sustained replica lag on <section>" caused another incomplete cookbook run when reimaging 
[14:08:58] <jynus>	 incomplete as in "need to reimage again" or just a non pass because some things were red?
[14:09:12] <jynus>	 if the second, you can ignore those
[14:09:46] <cezmunsta>	 Whilst it waits for up to 15 iterations of the "all clear" check before repooling
[14:12:08] <jynus>	 yeah, it is ok to be red while it recovers, as long as it gets eventually green
[14:13:44] <cezmunsta>	 The total number of attempts = 15 and each one is previous_wait + 3s... so that is really tight on the "all green" from Icinga 
[14:15:20] <jynus>	 yeah, those are thought for stateless hosts
[14:15:34] <jynus>	 imagine, when provisioning a host it takes way more than that
[14:16:43] <jynus>	 we tried having a few meetings with infra so they can better catter our needs
[14:25:59] <cezmunsta>	 It would seem trivial to be able to pass in a delay instead of hardcoding timedelta(seconds=3) for the retry
[14:27:25] <cezmunsta>	 We could equally catch those exceptions when suitable and retry the wait in the cookbook
[14:28:21] <cezmunsta>	 It is the second time that it actually broke on me and there were a number of last-iteration passes that have occurred, so nearly more
[14:32:25] <Emperor>	 my swift-in-pontoon cluster can now successfully run our rewrite_integration_test file :)
[14:34:24] <cezmunsta>	 *\o/*
[14:37:15] <Emperor>	 (and contains 3 originals and one thumbnail, as well as the dispersion-check guff)
[15:07:16] <jynus>	 moritzm: I am happy that I did the homework early for trixie (all my package seem to be ready): https://phabricator.wikimedia.org/T427897#11976933
[15:11:50] <jynus>	 ^ this are the typical issues I find when upgrading to trixie cezmunsta (my package is most likely ok, but dependencies were not fully ready)
[15:11:59] <jynus>	 *these
[15:12:36] <jynus>	 and sadly the coupling of mariadb for me is stronger with the rest of backup infra, so I need some extra time for testing
[15:22:35] <cezmunsta>	 ack 
[15:24:13] <cezmunsta>	 jynus: what did SubjectAltNameWarning come from?
[15:32:42] <jynus>	 My guess some python library change on urllib, but that's for cumin maintainers to figure out, supporing new python version/package version
[15:37:17] <marostegui>	 What were all those p4ge? I just saw them on my email
[15:41:49] <cezmunsta>	 jynus: commit fd0c475cc2c51aedb6c89d7b9be58d850966ee6a from 2020
[15:42:40] <cezmunsta>	 https://github.com/urllib3/urllib3/commit/fd0c475cc2c51aedb6c89d7b9be58d850966ee6a
[15:45:21] <cezmunsta>	 marostegui: "yeah, change applied to one rack went fine, but not the other" I think was the cause that you are looking for
[15:54:06] <marostegui>	 riiiiight
[15:54:15] <marostegui>	 Related to a3 maintenance then
[15:54:49] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s6 on db2158 is CRITICAL: 21.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104
[15:55:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s6 on db2169 is CRITICAL: 27.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104
[15:58:51] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s6 on db2158 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2158&var-port=9104
[15:59:01] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s6 on db2169 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104
[18:01:02] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s6 on db2169 is CRITICAL: 13.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104
[18:04:02] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s6 on db2169 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=9104