[04:03:44] <_joe_>	 I sincerely doubt that thing will hit production before we retire mutante :P
[04:06:08] <vgutierrez>	 ahahah
[04:06:18] <vgutierrez>	 maybe I get to see it ;P
[04:36:00] <_joe_>	 vgutierrez: you will just never retire
[04:36:10] <vgutierrez>	 :_(
[04:36:11] <_joe_>	 pensions will be abolished by then
[04:36:30] <vgutierrez>	 thanks gen X
[04:36:37] <_joe_>	 because 60% of the population will be 70 or older
[04:36:41] <_joe_>	 wat?
[04:36:47] <_joe_>	 you should thank the boomers
[04:36:56] <_joe_>	 they screwed all of us
[04:37:03] <_joe_>	 I will be able to retire at 70
[06:56:41] <_joe_>	 uhm so, I am toying with the ideas for writing this poolcounter-backed lock command to use from cli
[06:57:14] <_joe_>	 the cli version would be something like "run-once-cluster COMMANDS"
[06:57:20] <_joe_>	 and it's easy to write
[06:57:36] <_joe_>	 but say I want people to be able to get a lock at the start of the execution of a script
[06:57:42] <_joe_>	 and release it when it ends
[06:57:48] <_joe_>	 s/script/shell/
[06:59:48] <_joe_>	 I guess I'd have to execute my command from the shell, fork to a child, detach it from the parent, and have shell termination kill it
[07:00:07] <_joe_>	 the point being poolcounter locks need a continuing tcp connection to be kept
[09:08:46] <paladox>	 boomers?
[09:20:36] <jynus>	 there was someone doing maintenance on pupeptmasters, do you remember who?
[09:23:52] <jbond42>	 jynus: i had the puppet master disabled last night as i was doing some tests.  i should have removed the disable last night but forgot
[09:24:20] <jbond42>	 _joe_: was trying to fix puppet-merge in the last 30 minutes and it seems that this https://gerrit.wikimedia.org/r/c/operations/puppet/+/544159 revert should do that
[09:24:21] <jynus>	 nothing to do with that, I think there was someone upgrading/putting them down for some time
[09:24:36] <jynus>	 or even reinstalling them
[09:25:10] <jynus>	 let me elaborate- backups are failing there, and I would like someone to help me check why
[09:25:27] <jbond42>	 ahh yes morit.zm  was working with papaul to upgrade the firmware
[09:25:43] <jynus>	 thanks, I will ask him
[09:27:13] <moritzm>	 they were reinstalled in the last weeks, and then 2001 and 2002 were taken down for firmware updates this week (for about an hour each or so)
[09:27:39] <moritzm>	 https://phabricator.wikimedia.org/T235250
[09:28:19] <moritzm>	 2002: https://phabricator.wikimedia.org/T235250#5575795
[09:28:24] <moritzm>	 2001: https://phabricator.wikimedia.org/T235250#5580235
[09:28:59] <jynus>	 so 1001 has been failing since 2019-10-11 04:41:22
[09:29:31] <jbond42>	 ok i think that is since the upgrade
[09:29:39] <jynus>	 but that apparently is only on codfw
[09:29:45] <jynus>	 this is affecting I think all
[09:30:03] <jynus>	 jbond42: do you know who did the upgrade?
[09:30:09] <moritzm>	 1001 was re-installed on Oct 10, around 14h UTC
[09:30:16] <jynus>	 that fits
[09:30:55] <moritzm>	 what is failing specifically, something hw related or some OS error message?
[09:31:14] <jynus>	 I will try to pull logs now, for now it only saya fata error
[09:31:48] <jynus>	 before looking, it is probably some permission, or key or path change
[09:32:00] <moritzm>	 there were some hw errors showing up on 2001/2002/1001 related to CPU microcode
[09:32:31] <moritzm>	 on 2001/2002 they were fixed by Papaul's update, 1001 isn't scheduled yet, I'll ping Chris/John for that early next week
[09:32:51] <moritzm>	 but it's more likely that this is caused by some change related to the OS update
[09:33:02] <jynus>	 yeah
[09:33:18] <moritzm>	 I guess there were no issues for 2001/2002, right?
[09:33:24] <jynus>	 there are
[09:33:59] <jynus>	 puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl
[09:34:09] <jynus>	 puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-volatile
[09:34:53] <moritzm>	 are there other backup sets on 2001 which worked fine or are these the only two we apply there?
[09:35:11] <jynus>	 so it started failing at one point only
[09:35:19] <jynus>	 and then all failed after that
[09:35:41] <jynus>	 I mention both data sets because if both fail is normally the communication to the host
[09:35:56] <jynus>	 rather than "a directory disappeared" for example
[09:36:19] <jynus>	 were those buster? maybe there is some version mistmatch with the client?
[09:36:31] <jynus>	 I need to look at the logs first, sorry
[09:36:35] <moritzm>	 could be! the new puppetmasters are all buster
[09:37:02] <jynus>	 probably most of the things we backup is stretch or lower
[09:41:03] <moritzm>	 we do have a few buster hosts with backups, e.g. krb1001
[09:41:13] <jynus>	 oh :-)
[09:41:26] <jynus>	 https://phabricator.wikimedia.org/T234900#5584711
[09:41:34] <jynus>	 let me see if you see a pattern :-D
[09:41:45] <moritzm>	 per "cumin C:profile::backup::host 'cat /etc/debian_version'"  actually 13 already
[09:42:04] <jynus>	 krb2001 fails too
[09:42:14] <jynus>	 and so does krb1001
[09:45:01] <jynus>	 and so does netbox1001 (buster)
[09:45:30] <moritzm>	 and netboxdb and grafana1002 (per https://phabricator.wikimedia.org/T234900#5584711)
[09:45:36] <jynus>	 he
[09:45:53] <jynus>	 good thing I am implementing monitoring
[09:46:22] <moritzm>	 :-)
[09:51:03] <jynus>	 strange, the logs says the backup was ok
[09:51:46] <jynus>	 oh, I think I am looking at a previous year's backup
[09:52:18] <jynus>	 Don't like the "04-Nov 02:08" format
[09:53:21] <vgutierrez>	 yyyy-MM-dd FTW :)
[09:54:01] <jynus>	 also if you elect me president, I promise we will have more requently rotated logs!
[09:54:57] <jynus>	 Termination:            *** Backup Error ***
[09:55:08] <jynus>	 1-Oct 04:41 helium.eqiad.wmnet-fd JobId 156613: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.192.0.27:9103. Terminating connection.
[09:55:28] <jynus>	 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:388 Wrote 10 bytes to Storage daemon:helium.eqiad.wmnet:9103, but only 0 accepted.
[09:56:24] <jynus>	 I am going to create a task
[09:57:51] <jynus>	 moritzm: I will need your support to convince people that alerting on failed backups is a good thing
[09:58:23] <moritzm>	 I'm pretty sure everyone will see value in this :-)
[09:59:05] <jynus>	 I hope so
[10:31:28] <jynus>	 moritzm: so according to upstream: https://phabricator.wikimedia.org/T235838#5586233 so bacula should be one of the first hosts to be upgraded on new debian version
[10:33:21] <moritzm>	 that's unimpressive for a solution which calls itself an enterprise backup system...
[10:34:05] <jynus>	 also, while it make sense to be backwards compatible for a long time for clients
[10:34:26] <jynus>	 it is a bit scary because you have to upgrade your backup system before all other hosts
[10:34:53] <moritzm>	 well, at least for stretch it worked fine, with heze/helium still on jessie
[10:35:06] <jynus>	 yeah
[18:14:26] <mutante>	 i will add mediawiki/PHP classes on wtp* (parsoid) servers to turn them into appserver-style servers for parsoid/PHP (as opposed to parsoid/JS). that will add new Icinga checks and some alerts on -operations which i will try to catch early to reduce the noise. it does not mean these are pooled yet or the existing parsoid service is removed
[18:15:02] <mutante>	 just a heads-up if it's on wtp* it should be fine and recover soon after
[18:16:38] <mutante>	 you know how it goes if the puppet roles add the new check, you cant downtime it before it exists