[08:05:09] <elukey>	 XioNoX: o/ - there is a BFD down event for the Telia link between cr1 eqiad/codfw, I tried to check but I don't see maintenance and I noticed an even happened ~18 hours ago (notifed by Telia to us)
[08:05:42] <elukey>	 no idea though if there is a problem with the link or not (I checked lights but from my super ignorant pov I didn't see much either)
[08:10:29] <XioNoX>	 elukey: weird, pings go through
[08:11:56] <XioNoX>	 and ospf is up too
[08:12:30] <elukey>	 maybe we could test if slapping bfd resolves (could be a temporary weird state of bfd only?)
[08:12:31] <XioNoX>	 elukey: cr1-eqiad> clear bfd session address 208.80.153.221
[08:12:38] <elukey>	 exactly :)
[08:12:42] <XioNoX>	 yup :)
[08:14:04] <elukey>	 there you go, solved
[08:14:09] <elukey>	 not great though
[08:14:11] <XioNoX>	 CRITICAL: the following (14) node(s) change every puppet run
[08:16:47] <elukey>	 some of those are "WIP" nodes that are downtimed IIRC, and the check doesn't take it into account (there is a task about it to take out downtimed nodes from the list)
[08:17:44] <elukey>	 https://phabricator.wikimedia.org/T268211
[08:17:48] <elukey>	 ah there is a patch :)
[08:18:02] <XioNoX>	 yeah I know :)
[09:06:59] <jynus>	 akosiaris: thanks for the reminder, I will keep an eye on backup1001 to see if the refresh is enough, otherwise I will restart the daemon when appropiate
[09:07:07] <jynus>	 or any other issue
[09:07:57] <jynus>	 (lots of running jobs, being the 1st)
[09:24:17] <akosiaris>	 yw
[09:26:40] <jynus>	 I got a warning that an-master1002, bast3004, bast4002, bast5001 were removed from backups
[09:26:52] <jynus>	 if someone think that is an error please ping me
[09:32:51] <XioNoX>	 those bast hosts are still in service fwiw
[09:42:40] <moritzm>	 the backup was probably for some prometheus data and those are now pure bastions
[09:44:58] <jynus>	 cool, that would explain it, thanks moritzm
[09:45:05] <jynus>	 it doesn't hurt to double check :-)
[09:45:25] <jynus>	 nobody worries about backups until it is too late :-D
[09:46:34] <jynus>	 elukey: similar case for an-master1002? is it being decommed or changed profile to not need backups?
[09:46:36] <moritzm>	 currently digging a little deeper, these still use profile::backup::host, but maybe there are no simply no backup::sets defined
[09:47:04] <jynus>	 yeah, that is ok- it means backup infra is installed, but nothing would be scheduled
[09:48:20] <elukey>	 jynus: thanks for the ping, so on an-master1002 we were backing up an-coord1001's db but now it is not needed anymore, but there was also another backup IIRC about the hdfs namenode state
[09:48:24] <elukey>	 going to double check
[09:48:39] <jynus>	 let me see, I only see one job being removed
[09:48:41] <moritzm>	 there is actually some kind of error; bastion hosts have a backup::set for /home
[09:49:38] <jynus>	 moritzm: that is correct, home backups are still there
[09:50:03] <jynus>	 one per bast host, on 1002, 2002, 3004, 4002 and 5001
[09:50:15] <jynus>	 so only one job was removed, the others are still there
[09:50:32] <elukey>	 nice :)
[09:50:46] <jynus>	 for an-master1002, only hadoop-namenode-backup is left
[09:50:56] <moritzm>	 ah, ok. does it show which job name vanished? otherwise I'll poke at git history to find the likely culprit
[09:51:03] <jynus>	 moritzm: yes
[09:51:26] <jynus>	 Monthly-1st-Mon-production-srv-tftpboot for bastion
[09:51:37] <jynus>	 analytics-meta-mysql-lvm-backup for tracer
[09:51:43] <jynus>	 I mean, an-master1002
[09:51:59] <jynus>	 ^ moritzm, elukey all ok?
[09:52:07] <elukey>	 +1
[09:52:11] <jynus>	 cool
[09:52:29] <moritzm>	 jynus: gotcha, that makes sense and everything is in order, then. these are now served from install3001, install4001 and install5001
[09:52:30] <elukey>	 thanks a lot for following up, next time I'll alert you (didn't think about it)
[09:52:48] <jynus>	 no need for alerts, saw lots of things removed and wanted to double check
[09:52:59] <jynus>	 you can self-serve checking what is being backed up at:
[09:53:16] <jynus>	 https://grafana.wikimedia.org/d/413r2vbWk/bacula
[09:53:47] <jynus>	 (if you don't want to fight bacula cli :-))
[09:58:41] <jynus>	 also because bacula was quite spammy about it
[14:22:04] <hnowlan>	 check_mw_versions is doing a lot of firing again - I'd like to increase the grace period for starters but that alert needs fixing in general https://gerrit.wikimedia.org/r/c/operations/puppet/+/619482
[17:56:20] <volans>	 bstorm: I'm not sure if I understand your last reply in the CR, if you want we can chat about it here :)
[17:56:49] <bstorm>	 👋🏻
[17:57:47] <bstorm>	 I think I need to just generate a filtered list of nodes that would have a particular instance deployed on them. I was trying to basically filter execution based on the local existence of a file
[17:58:23] <bstorm>	 I don't want it to mark a host or execution as a failure if the file isn't there...and right now, that silly bit of shell script will return 1 if the file isn't there :)
[17:58:31] <bstorm>	 So I'll make another patch
[18:03:21] <volans>	 is the file managed by puppet by any chance?
[18:03:52] <volans>	 if you want to make it "successful" anyway you need something like: foo && bar || true
[18:04:41] <bstorm>	 It is
[18:04:56] <volans>	 I'm not suggesting to use the cumin's Command() class that allow you to set the ok_codes (exit codes considered successful) because the 'bar' command you have might succeed or fail and using ok_codes would mask that
[18:05:01] <bstorm>	 true! However, I realize, I can run a query and get the right values instead
[18:05:02] <volans>	 because if it's manage dby puppet you can query it
[18:05:11] <bstorm>	 Yes that :)
[18:05:12] <volans>	 and have a different remote_hosts object
[18:05:18] <volans>	 with just the right hosts :)
[18:05:19] <volans>	 great
[18:05:31] <bstorm>	 Thanks :)
[18:05:33] <volans>	 R:File = /path/
[18:05:39] <volans>	 np :)
[18:05:43] <volans>	 sorry for the bothering
[19:10:37] <bstorm>	 Cumin querying is currently making me scratch my head...
[19:10:50] <bstorm>	 `R:Profile::Mariadb::Section = "s7" and P:wmcs::db::wikireplicas::mariadb_multiinstance` would also give me what I need. However, it doesn't.
[19:10:51] <volans>	 bstorm: what's up?
[19:11:02] <bstorm>	 R:Profile::Mariadb::Section = "s7" includes the hosts I want
[19:11:10] <bstorm>	 P:wmcs::db::wikireplicas::mariadb_multiinstance includes the hosts I want
[19:11:14] <volans>	 yes, because you can query one resource per query with puppetdb
[19:11:20] <bstorm>	 aahhhhh
[19:11:22] <volans>	 so you need to use the global grammar in this case
[19:11:23] <bstorm>	 Damn ok
[19:11:25] <volans>	 to combine both
[19:11:32] <volans>	 P{query1} and P{query2}
[19:11:43] <bstorm>	 I'll try that :)
[19:11:57] <volans>	 wait you want both of them or those that are in both subsets?
[19:12:03] <bstorm>	 I was just figuring that I can query on precisely what I want instead of the file...which is an implementation detail
[19:12:13] <bstorm>	 I want an inner join
[19:12:17] <bstorm>	 :)
[19:12:18] <volans>	 lol
[19:12:22] <volans>	 ok 'and' it is
[19:12:33] <bstorm>	 thx
[19:13:19] <bstorm>	 That's it. I get the hosts I want with that version 🎉
[19:13:26] <volans>	 bstorm: it's mentioned in https://wikitech.wikimedia.org/wiki/Cumin#Features fwiw
[19:14:13] <bstorm>	 Ah yes "Only one main Resource per PuppetDB query can be specified"...however you only see that if you *read* documentation instead of skimming it for what you want and start banging on a keyboard
[19:14:24] <bstorm>	 😁
[19:15:52] <bstorm>	 I'll have to read that doc more carefully next time
[19:38:35] <ryankemper>	 Anyone have a guess why my spicerack patch (https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/643532/) is getting `pylint: no-member / Instance of 'list' has no 'difference_update' member (col 12)` for `spicerack/remote.py:515`? It's unrelated to my change but popped up after rebasing onto latest master, see https://integration.wikimedia.org/ci/job/tox-docker/15663/console for the tox run
[19:39:36] <ryankemper>	 The line in question is this one: https://github.com/wikimedia/operations-software-spicerack/blob/4efbf34ca4c6f3f7e1460dd91b8a6eb216e32750/spicerack/remote.py#L515 and the tldr is `self.hosts`(https://github.com/wikimedia/operations-software-spicerack/blob/4efbf34ca4c6f3f7e1460dd91b8a6eb216e32750/spicerack/remote.py#L508) is apparently a `list`
[19:40:03] <ryankemper>	 But looking at the definition of `self.hosts` it should return a `ClusterShell.NodeSet.NodeSet: a copy of the targeted hosts.`, see https://github.com/wikimedia/operations-software-spicerack/blob/4efbf34ca4c6f3f7e1460dd91b8a6eb216e32750/spicerack/remote.py#L378-L386
[19:49:33] <mutante>	 ryankemper: fixed it
[19:49:38] <mutante>	 ryankemper: but I can't tell you why:)
[19:49:44] <mutante>	 but you got V+2 now, so go ahead :p
[19:49:51] <volans>	 ryankemper: just reply with recheck
[19:49:53] <volans>	 it's a bug in CI
[19:49:55] <mutante>	 I just told it to repeat it ^
[19:50:00] <mutante>	 did what volans said
[19:50:04] <volans>	 :D
[19:50:05] <mutante>	 comment with the magic word
[19:50:10] <ryankemper>	 Ah :) thanks
[19:50:18] <volans>	 that object is a nodeset and not a list
[19:50:39] <volans>	 the static analyzer in CI randmly fails I've pointed it out previously, not sure if hasharAway has had a chance to look at it
[19:51:00] <mutante>	 if jerkins says -1, always ask back if it's sure about that :)
[19:51:18] <mutante>	 well not always, but if it's a weird case
[19:51:21] <hasharAway>	 if its -1 , developers failed :]
[19:51:50] <ryankemper>	 sounds good! that explains why I couldn't find a culprit git commit in any of the commits that were introduced by the rebase
[19:52:05] <hasharAway>	 I am having dinner sorry but should be around in ~ 15 minutes.   Or at worth bump the related phabricator task if there is any
[19:53:58] <ryankemper>	 volans: I've merged the patch, I'm ready to test the new version at your leisure
[20:07:47] <volans>	 ryankemper: I'm preparing dinner, it's ok if I do a new release tomorrow morning EU time?
[20:08:24] <ryankemper>	 volans: absolutely! there's no rush
[20:08:54] <ryankemper>	 and in the future this stuff will be proper config so that you won't have to :P
[20:16:10] <volans>	 k thx
[20:26:14] <hashar>	 no clue why it would fail in CI really
[20:26:49] <hashar>	 maybe `prospector` get confused at some point and mix up list() and set()
[20:27:09] <hashar>	 anyway it passed eventually ;)
[21:18:40] <rzl>	 lists also have a copy(), which returns a list -- so one way you'd get that error is if self._hosts ever has a list value when that difference_update line is reached
[21:19:41] <rzl>	 offhand I don't see anywhere that's obviously true, but it's not obvious to me that it's never true, either -- I think it's more likely that the checker is finding a real bug, even if it's in a corner case that we don't care about in prod
[21:20:11] <rzl>	 and either way I'd much rather have it fixed than just recheck forever :) that confused me too when I was working on unrelated spicerack code