[08:05:09] XioNoX: o/ - there is a BFD down event for the Telia link between cr1 eqiad/codfw, I tried to check but I don't see maintenance and I noticed an even happened ~18 hours ago (notifed by Telia to us) [08:05:42] no idea though if there is a problem with the link or not (I checked lights but from my super ignorant pov I didn't see much either) [08:10:29] elukey: weird, pings go through [08:11:56] and ospf is up too [08:12:30] maybe we could test if slapping bfd resolves (could be a temporary weird state of bfd only?) [08:12:31] elukey: cr1-eqiad> clear bfd session address 208.80.153.221 [08:12:38] exactly :) [08:12:42] yup :) [08:14:04] there you go, solved [08:14:09] not great though [08:14:11] CRITICAL: the following (14) node(s) change every puppet run [08:16:47] some of those are "WIP" nodes that are downtimed IIRC, and the check doesn't take it into account (there is a task about it to take out downtimed nodes from the list) [08:17:44] https://phabricator.wikimedia.org/T268211 [08:17:48] ah there is a patch :) [08:18:02] yeah I know :) [09:06:59] akosiaris: thanks for the reminder, I will keep an eye on backup1001 to see if the refresh is enough, otherwise I will restart the daemon when appropiate [09:07:07] or any other issue [09:07:57] (lots of running jobs, being the 1st) [09:24:17] yw [09:26:40] I got a warning that an-master1002, bast3004, bast4002, bast5001 were removed from backups [09:26:52] if someone think that is an error please ping me [09:32:51] those bast hosts are still in service fwiw [09:42:40] the backup was probably for some prometheus data and those are now pure bastions [09:44:58] cool, that would explain it, thanks moritzm [09:45:05] it doesn't hurt to double check :-) [09:45:25] nobody worries about backups until it is too late :-D [09:46:34] elukey: similar case for an-master1002? is it being decommed or changed profile to not need backups? [09:46:36] currently digging a little deeper, these still use profile::backup::host, but maybe there are no simply no backup::sets defined [09:47:04] yeah, that is ok- it means backup infra is installed, but nothing would be scheduled [09:48:20] jynus: thanks for the ping, so on an-master1002 we were backing up an-coord1001's db but now it is not needed anymore, but there was also another backup IIRC about the hdfs namenode state [09:48:24] going to double check [09:48:39] let me see, I only see one job being removed [09:48:41] there is actually some kind of error; bastion hosts have a backup::set for /home [09:49:38] moritzm: that is correct, home backups are still there [09:50:03] one per bast host, on 1002, 2002, 3004, 4002 and 5001 [09:50:15] so only one job was removed, the others are still there [09:50:32] nice :) [09:50:46] for an-master1002, only hadoop-namenode-backup is left [09:50:56] ah, ok. does it show which job name vanished? otherwise I'll poke at git history to find the likely culprit [09:51:03] moritzm: yes [09:51:26] Monthly-1st-Mon-production-srv-tftpboot for bastion [09:51:37] analytics-meta-mysql-lvm-backup for tracer [09:51:43] I mean, an-master1002 [09:51:59] ^ moritzm, elukey all ok? [09:52:07] +1 [09:52:11] cool [09:52:29] jynus: gotcha, that makes sense and everything is in order, then. these are now served from install3001, install4001 and install5001 [09:52:30] thanks a lot for following up, next time I'll alert you (didn't think about it) [09:52:48] no need for alerts, saw lots of things removed and wanted to double check [09:52:59] you can self-serve checking what is being backed up at: [09:53:16] https://grafana.wikimedia.org/d/413r2vbWk/bacula [09:53:47] (if you don't want to fight bacula cli :-)) [09:58:41] also because bacula was quite spammy about it [14:22:04] check_mw_versions is doing a lot of firing again - I'd like to increase the grace period for starters but that alert needs fixing in general https://gerrit.wikimedia.org/r/c/operations/puppet/+/619482 [17:56:20] bstorm: I'm not sure if I understand your last reply in the CR, if you want we can chat about it here :) [17:56:49] 👋🏻 [17:57:47] I think I need to just generate a filtered list of nodes that would have a particular instance deployed on them. I was trying to basically filter execution based on the local existence of a file [17:58:23] I don't want it to mark a host or execution as a failure if the file isn't there...and right now, that silly bit of shell script will return 1 if the file isn't there :) [17:58:31] So I'll make another patch [18:03:21] is the file managed by puppet by any chance? [18:03:52] if you want to make it "successful" anyway you need something like: foo && bar || true [18:04:41] It is [18:04:56] I'm not suggesting to use the cumin's Command() class that allow you to set the ok_codes (exit codes considered successful) because the 'bar' command you have might succeed or fail and using ok_codes would mask that [18:05:01] true! However, I realize, I can run a query and get the right values instead [18:05:02] because if it's manage dby puppet you can query it [18:05:11] Yes that :) [18:05:12] and have a different remote_hosts object [18:05:18] with just the right hosts :) [18:05:19] great [18:05:31] Thanks :) [18:05:33] R:File = /path/ [18:05:39] np :) [18:05:43] sorry for the bothering [19:10:37] Cumin querying is currently making me scratch my head... [19:10:50] `R:Profile::Mariadb::Section = "s7" and P:wmcs::db::wikireplicas::mariadb_multiinstance` would also give me what I need. However, it doesn't. [19:10:51] bstorm: what's up? [19:11:02] R:Profile::Mariadb::Section = "s7" includes the hosts I want [19:11:10] P:wmcs::db::wikireplicas::mariadb_multiinstance includes the hosts I want [19:11:14] yes, because you can query one resource per query with puppetdb [19:11:20] aahhhhh [19:11:22] so you need to use the global grammar in this case [19:11:23] Damn ok [19:11:25] to combine both [19:11:32] P{query1} and P{query2} [19:11:43] I'll try that :) [19:11:57] wait you want both of them or those that are in both subsets? [19:12:03] I was just figuring that I can query on precisely what I want instead of the file...which is an implementation detail [19:12:13] I want an inner join [19:12:17] :) [19:12:18] lol [19:12:22] ok 'and' it is [19:12:33] thx [19:13:19] That's it. I get the hosts I want with that version 🎉 [19:13:26] bstorm: it's mentioned in https://wikitech.wikimedia.org/wiki/Cumin#Features fwiw [19:14:13] Ah yes "Only one main Resource per PuppetDB query can be specified"...however you only see that if you *read* documentation instead of skimming it for what you want and start banging on a keyboard [19:14:24] 😁 [19:15:52] I'll have to read that doc more carefully next time [19:38:35] Anyone have a guess why my spicerack patch (https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/643532/) is getting `pylint: no-member / Instance of 'list' has no 'difference_update' member (col 12)` for `spicerack/remote.py:515`? It's unrelated to my change but popped up after rebasing onto latest master, see https://integration.wikimedia.org/ci/job/tox-docker/15663/console for the tox run [19:39:36] The line in question is this one: https://github.com/wikimedia/operations-software-spicerack/blob/4efbf34ca4c6f3f7e1460dd91b8a6eb216e32750/spicerack/remote.py#L515 and the tldr is `self.hosts`(https://github.com/wikimedia/operations-software-spicerack/blob/4efbf34ca4c6f3f7e1460dd91b8a6eb216e32750/spicerack/remote.py#L508) is apparently a `list` [19:40:03] But looking at the definition of `self.hosts` it should return a `ClusterShell.NodeSet.NodeSet: a copy of the targeted hosts.`, see https://github.com/wikimedia/operations-software-spicerack/blob/4efbf34ca4c6f3f7e1460dd91b8a6eb216e32750/spicerack/remote.py#L378-L386 [19:49:33] ryankemper: fixed it [19:49:38] ryankemper: but I can't tell you why:) [19:49:44] but you got V+2 now, so go ahead :p [19:49:51] ryankemper: just reply with recheck [19:49:53] it's a bug in CI [19:49:55] I just told it to repeat it ^ [19:50:00] did what volans said [19:50:04] :D [19:50:05] comment with the magic word [19:50:10] Ah :) thanks [19:50:18] that object is a nodeset and not a list [19:50:39] the static analyzer in CI randmly fails I've pointed it out previously, not sure if hasharAway has had a chance to look at it [19:51:00] if jerkins says -1, always ask back if it's sure about that :) [19:51:18] well not always, but if it's a weird case [19:51:21] if its -1 , developers failed :] [19:51:50] sounds good! that explains why I couldn't find a culprit git commit in any of the commits that were introduced by the rebase [19:52:05] I am having dinner sorry but should be around in ~ 15 minutes. Or at worth bump the related phabricator task if there is any [19:53:58] volans: I've merged the patch, I'm ready to test the new version at your leisure [20:07:47] ryankemper: I'm preparing dinner, it's ok if I do a new release tomorrow morning EU time? [20:08:24] volans: absolutely! there's no rush [20:08:54] and in the future this stuff will be proper config so that you won't have to :P [20:16:10] k thx [20:26:14] no clue why it would fail in CI really [20:26:49] maybe `prospector` get confused at some point and mix up list() and set() [20:27:09] anyway it passed eventually ;) [21:18:40] lists also have a copy(), which returns a list -- so one way you'd get that error is if self._hosts ever has a list value when that difference_update line is reached [21:19:41] offhand I don't see anywhere that's obviously true, but it's not obvious to me that it's never true, either -- I think it's more likely that the checker is finding a real bug, even if it's in a corner case that we don't care about in prod [21:20:11] and either way I'd much rather have it fixed than just recheck forever :) that confused me too when I was working on unrelated spicerack code