[10:10:19] I have a riddle for you- a backup was alerting because the latest full backup was 0 bytes, but it was a false positive, can you guess why? [10:14:46] because it was freshly reimaged and there was no data to backup yet? [10:15:25] probably that- the full backup ran with 0 bytes correctly, but it was considered a failure [10:15:53] because backing up 0 bytes is considered worth imvestigating (probably data lost or wronh path configured) [10:16:02] and then the next incremental added all data [10:16:54] some incremental :-D [10:19:43] (incrementals don't check 0 size because it is common data doesn't change from day to day) [13:25:43] apergos: thanks for that services: endgame doc link [13:25:55] sure [13:26:19] again I don't know how ready for open comments it is yet [13:30:39] Totally, just good to see the thinking that's happening [13:37:27] 👍 [18:43:36] who should I direct the "Memory correctable errors -EDAC-" errors to? DCops? Service owner? [18:55:24] I guess the service owner should be tagged but dc ops should get it in their queue since they'll likely have to deal with diagnostics, coordinating any downtime, talking to the vendor or etc [19:01:41] XioNoX: I would suggest service owner as sometime if they are correctable they just get fixed in a few days, and if not tag DC-Ops (I always try to reduce the noise for them as much as I can) [19:02:00] ok! [19:04:07] i think our threshold for that is too low fwiw [19:05:28] also IMO there are long-open questions about whether it's the service owner, or dcops, or infra foundations (or some other SRE team) mediating between service owner / dcops somehow, that i think we'll need to answer eventually [19:06:44] I opened https://phabricator.wikimedia.org/T238018 for now [19:07:12] it's also on my list from https://phabricator.wikimedia.org/T225140 [19:07:35] I like that list [20:13:47] <_joe_> cdanis: is it really a question? Like a memory bank is broken and someone else than dcops should be running point? [20:14:28] <_joe_> I can imagine for some critical servers the service owners might need to be involved [20:14:53] <_joe_> maybe I'm missing a lot of context [20:15:04] if we had automated depooling or some other similar agreement I'd totally agree [20:15:17] but for many services it is not as simple as it should be [20:15:55] <_joe_> anything that's load-balanced is ok, really [20:16:07] <_joe_> even without depooling (that is just one command) [20:16:37] <_joe_> but what I was saying is - I think dcops and the service owners don't need a mediator :) [20:17:03] <_joe_> and in most cases, dcops shouldn't even need the service owners [20:20:33] if dcops has enough rights (sudo, etc) to do all the required troubleshooting, then I agree [20:25:33] <_joe_> do they need sudo to verify a ram bank? [20:26:06] <_joe_> XioNoX: my point is that in theory you should be able to turn off a server and not worry about depooling, even [20:26:31] <_joe_> anyways, meal time on the plane, we will continue this tomorrow in case :) [20:27:17] I didn't dig but there are some sudo commands on https://wikitech.wikimedia.org/wiki/Monitoring/Memory#Memory_correctable_errors_-EDAC- [20:30:37] i think someone was looking at PXE boot menus with stuff like memtest?