[08:31:25] <_joe_> moritzm: do we have a ticket re: rsyslog upgrade? [08:31:57] <_joe_> I think I found the problem, and it has to do with the currently installed version of the package AIUI [08:37:07] let me find it [08:37:29] https://phabricator.wikimedia.org/T219764 [08:47:40] just reproduced it on pybal-test2001, the prerm script is broken [08:55:44] <_joe_> lol [08:55:52] <_joe_> I just wrote as much on the ticket [08:56:07] <_joe_> but I'm not sure how it's broken [08:56:16] <_joe_> how did you reproduce btw? [09:02:00] I'm currently trying to narrow it down: if I stop syslog.socket and rsyslog before upgrading rsyslog and rsyslog-gnutls, the upgrade works fine, if I upgrade both together it fails, but a subsequent second attempt to install rsyslog works fine, I think it's somehow caused by the order of upgrading the packages [09:03:43] <_joe_> I was thinking the same [09:39:47] added my findings to the Phab task [09:51:24] moritzm: re:ms-be2026 it's probably safer to depool it from the ring, not sure if it was already done [09:53:12] probably, I haven't done anything apart from what I added to the Phab task [09:55:30] * volans has never done it, reading wikitech [12:03:29] from other tasks I see that god.og usually doesn't remove them for a normal broken disk/host, so holding for now [12:16:42] ack [13:10:17] moritzm: briefly spoken with go.dog that suggested to try a reboot that sometimes might "fix" the controller, I'll do that [13:17:05] sounds good! [13:30:20] volans: I can take it out of the swift ring if necessary [13:30:23] did the reboot help? [13:30:50] cdanis: no need, it's rebooting now, I can access so at least it's not all broken :D [13:31:04] ok :) [13:31:05] are the disks there? [13:32:12] physically seems so, md is not happy though [13:38:39] scratch that, all good [13:40:09] turning it off and on again in 2019 [13:41:49] lol [13:42:00] it was an april fool [13:42:03] :-P [13:42:41] hey volans, check_icinga just got upset about icinga2001 [13:42:47] taking a look [13:43:22] cdanis: as usual [13:43:24] broken config [13:43:29] + sync [13:43:47] aha [13:43:55] well i'm glad it isn't something new [13:44:33] see -ops [13:44:34] I am going to ask a question I know I will regret asking [13:44:39] eh I'll do it in #-ops ;) [13:45:03] question here too, I just pinged the author of the commit there :D [14:02:05] jbond42: wow, that was quite some home files... IIRC we tend to keep them on the low number because puppet directory sync is quite under-optimized, but I might recall wrongly or newer puppet might have become better at that [14:07:14] volans: yes it is i will strip out most of the vim stuff as 95% is unused [14:07:41] thanks :) [14:55:41] we have ipmievd running on cumin2001, but not on cumin1001, doesn't seem puppetised, is that for some tests/automation work? [14:58:33] not that I'm aware [14:59:52] ENABLED="false" [15:00:43] yeah, but there's a comment "ignored by systemd" [15:00:57] which seems correct here :-) [15:01:21] the /proc/$PID directory it's from Sep 17 2018 [15:01:40] yeah, there's no SAL entry for that date [15:01:56] last boot was Sep 12 [15:02:00] close but not the same [15:02:48] we should just routinely reimage machines at random ;) [15:03:54] wheel. of. reimage. [15:04:03] in this case a reboot would have be enough :D [15:04:09] *been [15:04:15] I agree with that from the perspective of keeping up with OS upgrades too right? [15:04:49] we're keeping up with OS upgrades independent of reboots a well :-) [15:05:12] - i meant reimaging ahah but I thought of multiple issues with that idea tbh [15:06:13] volans: I'd simply "systemctl stop" and then "systemctl disable" ipmievd, was probably some local experiment, if anyone complains we'll find out? [15:06:20] +1 [15:07:04] done [15:07:15] is it worth SALing that? [15:07:42] already typing it up [15:08:47] Are we doing the team meeting tomorrow? [15:09:34] the pad is there as usual [15:16:03] I think let's meet tomorrow? [15:16:38] +1 for me [15:23:08] sure, meeting sounds good [15:23:47] +1 [15:24:00] +1 [17:05:04] <_joe_> hey people, TL;DR: we're making all the graphite checks (and I hope it's not the same for grafana / prometheus but I'd check) from icinga via graphite.wikimedia.org [17:05:45] <_joe_> given graphite is a piece of garbage, sometimes it doesn't send out caching headers so varnish happily caches the result of the check url for... one day [17:07:04] <_joe_> I've written some patches, but I'd like for someone else to continue looking at this [17:07:14] <_joe_> well I'll write a task :) [17:07:49] prometheus only has an internal LVS endpoint, no Varnish there [17:07:53] thankfully [17:08:25] <_joe_> ok, not sure about the few check_grafana checks we have [17:08:37] <_joe_> but yeah, we were monitoring cached data ;P [17:19:30] <_joe_> T219902 created [17:28:08] I see cache-control: no-cache and x-cache-status: pass on https://grafana.wikimedia.org/api/alerts which is the endpoint used by check_grafana [17:28:32] will comment on task