[09:08:19] the pad for tomorrow is up [09:09:34] thanks godog! [09:10:21] yw volans [12:55:46] godog: is there a way to downtime all of esams? [12:56:11] I'm going to upgrade the current switch stack so that means downtime [12:57:09] XioNoX: not sure about whole esams but for pages searching "#page esams" in icinga should show relevant alerts to then downtime [12:58:11] I'm checking the hosts [12:59:47] XioNoX: possibly downtiming the host groups for access switches will downtime also hosts [13:00:03] yeah, I don't think we can get them all [13:00:26] regex in the icina ui would be great [13:00:38] like all the mgmt hosts, etc.. [13:00:53] the downtime cookbook will get all hosts, but doesn't help you with services I think [13:02:19] ah? [13:02:31] ah yeah that should do it, once the host is downtimed its services shouldn't be considered so that should work [13:02:33] still better than nothing :) [13:02:39] you mean cross-servers service checks? the downtime cookbook also silences all the servics running on a server [13:03:36] cookbook sre.hosts.downtime -M 300 -r "maint work" A:esams [13:03:42] what about any service checks that have a 'host' of icinga1001 or whatever [13:03:48] should do the trick I (except mgmt) [13:04:43] yeah I think the downtime cookbook plus downtiming pages for esams like text-lb.esams.wikimedia.org should cover most of it [13:05:45] yeah, that. and maybe we can catch the mgmt via some clever host group dependency selection trick in the Icinga UI [13:06:27] also there are a bunch of dependencies in icinga already from host to its network switch so if the latter is down then icinga should DTRT [13:10:51] should I do this as selector: `P:cumin::target%site = esams` ? [13:11:20] or `R:class%site = esams` ? [13:11:34] trying to figure out what fits best from https://wikitech.wikimedia.org/wiki/Cumin [13:11:36] A:esams should work [13:11:55] you can check /etc/cumin/aliases.yaml if you want to make sure [13:11:56] cool [13:12:28] esams: P{P:cumin::target%site = esams} [13:14:37] yeah A:esams will work then [13:14:59] 'A:' is for 'alias' as defined in that file [13:16:22] Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: bast3002.wikimedia.org,cp[3007-3008,3010,3030,3032-3036,3038-3047,3049].esams.wmnet,lvs[3001-3004].esams.wmnet,maerlant.wikimedia.org,multatuli.wikimedia.org,nescio.wikimedia.org [13:16:23] 15:14 yay [13:16:23] 15:15 and manually downtimed everything with esams in the name from the ui [13:16:28] sent that in the wrong channel [13:16:46] going to let the site drain a bit [13:16:47] that alias list is maintainer in puppet, if you miss anything useful as an alias, we can add it [13:20:34] we could add an option to the cookbook to downtime also the mgmt optionally or make another one for the mgmt [13:20:44] we already downtime mgmt in the reimage script for example [13:21:10] ideally the mgmt would be childs of *at least* the mr1 router [13:21:26] so as long as this is downtime the other ones should not alert [13:23:07] in the reimage script we ask for the mgmt hostname of wikimedia.org hosts, that would be a little pesky for something fleetwide, or are these being read from netbox by now? [13:23:46] but in general it would be a nice option for the downtime cookbook indeed [13:24:56] they are imported in netbox, not yet fully official but mostly usable [13:25:00] and will be official very soon [13:25:05] so yeah that's an option too [13:25:18] nice