[04:03:44] <_joe_> I sincerely doubt that thing will hit production before we retire mutante :P [04:06:08] ahahah [04:06:18] maybe I get to see it ;P [04:36:00] <_joe_> vgutierrez: you will just never retire [04:36:10] :_( [04:36:11] <_joe_> pensions will be abolished by then [04:36:30] thanks gen X [04:36:37] <_joe_> because 60% of the population will be 70 or older [04:36:41] <_joe_> wat? [04:36:47] <_joe_> you should thank the boomers [04:36:56] <_joe_> they screwed all of us [04:37:03] <_joe_> I will be able to retire at 70 [06:56:41] <_joe_> uhm so, I am toying with the ideas for writing this poolcounter-backed lock command to use from cli [06:57:14] <_joe_> the cli version would be something like "run-once-cluster COMMANDS" [06:57:20] <_joe_> and it's easy to write [06:57:36] <_joe_> but say I want people to be able to get a lock at the start of the execution of a script [06:57:42] <_joe_> and release it when it ends [06:57:48] <_joe_> s/script/shell/ [06:59:48] <_joe_> I guess I'd have to execute my command from the shell, fork to a child, detach it from the parent, and have shell termination kill it [07:00:07] <_joe_> the point being poolcounter locks need a continuing tcp connection to be kept [09:08:46] boomers? [09:20:36] there was someone doing maintenance on pupeptmasters, do you remember who? [09:23:52] jynus: i had the puppet master disabled last night as i was doing some tests. i should have removed the disable last night but forgot [09:24:20] _joe_: was trying to fix puppet-merge in the last 30 minutes and it seems that this https://gerrit.wikimedia.org/r/c/operations/puppet/+/544159 revert should do that [09:24:21] nothing to do with that, I think there was someone upgrading/putting them down for some time [09:24:36] or even reinstalling them [09:25:10] let me elaborate- backups are failing there, and I would like someone to help me check why [09:25:27] ahh yes morit.zm was working with papaul to upgrade the firmware [09:25:43] thanks, I will ask him [09:27:13] they were reinstalled in the last weeks, and then 2001 and 2002 were taken down for firmware updates this week (for about an hour each or so) [09:27:39] https://phabricator.wikimedia.org/T235250 [09:28:19] 2002: https://phabricator.wikimedia.org/T235250#5575795 [09:28:24] 2001: https://phabricator.wikimedia.org/T235250#5580235 [09:28:59] so 1001 has been failing since 2019-10-11 04:41:22 [09:29:31] ok i think that is since the upgrade [09:29:39] but that apparently is only on codfw [09:29:45] this is affecting I think all [09:30:03] jbond42: do you know who did the upgrade? [09:30:09] 1001 was re-installed on Oct 10, around 14h UTC [09:30:16] that fits [09:30:55] what is failing specifically, something hw related or some OS error message? [09:31:14] I will try to pull logs now, for now it only saya fata error [09:31:48] before looking, it is probably some permission, or key or path change [09:32:00] there were some hw errors showing up on 2001/2002/1001 related to CPU microcode [09:32:31] on 2001/2002 they were fixed by Papaul's update, 1001 isn't scheduled yet, I'll ping Chris/John for that early next week [09:32:51] but it's more likely that this is caused by some change related to the OS update [09:33:02] yeah [09:33:18] I guess there were no issues for 2001/2002, right? [09:33:24] there are [09:33:59] puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl [09:34:09] puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-volatile [09:34:53] are there other backup sets on 2001 which worked fine or are these the only two we apply there? [09:35:11] so it started failing at one point only [09:35:19] and then all failed after that [09:35:41] I mention both data sets because if both fail is normally the communication to the host [09:35:56] rather than "a directory disappeared" for example [09:36:19] were those buster? maybe there is some version mistmatch with the client? [09:36:31] I need to look at the logs first, sorry [09:36:35] could be! the new puppetmasters are all buster [09:37:02] probably most of the things we backup is stretch or lower [09:41:03] we do have a few buster hosts with backups, e.g. krb1001 [09:41:13] oh :-) [09:41:26] https://phabricator.wikimedia.org/T234900#5584711 [09:41:34] let me see if you see a pattern :-D [09:41:45] per "cumin C:profile::backup::host 'cat /etc/debian_version'" actually 13 already [09:42:04] krb2001 fails too [09:42:14] and so does krb1001 [09:45:01] and so does netbox1001 (buster) [09:45:30] and netboxdb and grafana1002 (per https://phabricator.wikimedia.org/T234900#5584711) [09:45:36] he [09:45:53] good thing I am implementing monitoring [09:46:22] :-) [09:51:03] strange, the logs says the backup was ok [09:51:46] oh, I think I am looking at a previous year's backup [09:52:18] Don't like the "04-Nov 02:08" format [09:53:21] yyyy-MM-dd FTW :) [09:54:01] also if you elect me president, I promise we will have more requently rotated logs! [09:54:57] Termination: *** Backup Error *** [09:55:08] 1-Oct 04:41 helium.eqiad.wmnet-fd JobId 156613: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.192.0.27:9103. Terminating connection. [09:55:28] 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:388 Wrote 10 bytes to Storage daemon:helium.eqiad.wmnet:9103, but only 0 accepted. [09:56:24] I am going to create a task [09:57:51] moritzm: I will need your support to convince people that alerting on failed backups is a good thing [09:58:23] I'm pretty sure everyone will see value in this :-) [09:59:05] I hope so [10:31:28] moritzm: so according to upstream: https://phabricator.wikimedia.org/T235838#5586233 so bacula should be one of the first hosts to be upgraded on new debian version [10:33:21] that's unimpressive for a solution which calls itself an enterprise backup system... [10:34:05] also, while it make sense to be backwards compatible for a long time for clients [10:34:26] it is a bit scary because you have to upgrade your backup system before all other hosts [10:34:53] well, at least for stretch it worked fine, with heze/helium still on jessie [10:35:06] yeah [18:14:26] i will add mediawiki/PHP classes on wtp* (parsoid) servers to turn them into appserver-style servers for parsoid/PHP (as opposed to parsoid/JS). that will add new Icinga checks and some alerts on -operations which i will try to catch early to reduce the noise. it does not mean these are pooled yet or the existing parsoid service is removed [18:15:02] just a heads-up if it's on wtp* it should be fine and recover soon after [18:16:38] you know how it goes if the puppet roles add the new check, you cant downtime it before it exists