[09:23:25] ooooh [09:24:17] _joe_: so, this relates to that cron yesterday! We want to make a change looking at the max lagged wdqs server. but in order to do that correctly we need to know which servers are actually pooled [09:24:26] using the lag from a depooled servers is a bad idea [09:24:40] the question is, what is the best way to find that information out [09:26:29] <_joe_> oh you'll love this [09:27:18] :D [09:28:18] <_joe_> so the best way to know the status of a server is to make a request to the load-balancers to know the status of the servers [09:28:33] <_joe_> but I think that information is now exported to prometheus somehow too [09:29:23] sure [09:29:34] pybal exposes those metrics to prometheus [09:29:49] <_joe_> so given you're already querying prometheus [09:30:14] oooooooh [09:30:37] that sounds ideal, suddenly my slightly evil change is sounding less evil [09:30:44] <_joe_> pybal_monitor_status{host="wdqs1005.eqiad.wmnet",monitor="ProxyFetch",service="wdqs-ssl_443"} 1.0 [09:31:04] <_joe_> so you just need to know which *service* you're interested in [09:31:07] <_joe_> and this gets kinky [09:31:09] <_joe_> :P [09:31:18] <_joe_> we have two clusters [09:31:36] <_joe_> wdqs-internal and wdqs [09:31:42] Okay, I can make that work [09:31:43] <_joe_> not sure which one you're interested in [09:31:57] <_joe_> addshore: just make the "service" configurable [09:32:04] <_joe_> or well [09:32:05] So, I already make a prometheus query getting the lag state of all servers on the wdqs cluster [09:32:11] <_joe_> ok [09:32:15] <_joe_> so you have the hostnames [09:32:23] then i just need to sort them, and start at the top and check if prometheus says they are pooled [09:32:24] great! [09:32:28] <_joe_> so you can just check pybal_monitor_status{host="wdqs1005.eqiad.wmnet"} for instance [09:32:35] errr [09:32:36] * addshore goes to test that [09:32:37] that's not accurate [09:32:57] <_joe_> vgutierrez: why? [09:33:06] cause the service can be down and the host still pooled [09:33:07] <_joe_> oh right that's the monitor status [09:33:10] <_joe_> not pooled status [09:33:13] <_joe_> you're correct [09:33:19] and that's the monitor status, you can have several monitors in one host [09:33:44] <_joe_> vgutierrez: so we don't have the current up status in prometheus [09:34:37] it looks like we don't have it yeah [09:34:46] <_joe_> uhm [09:34:56] <_joe_> addshore: ok, it becomes evil again then [09:35:42] so, monitor status = up or down? and it can be up while not pooled. ack [09:35:52] <_joe_> and vice-versa [09:35:54] <_joe_> yes [09:36:09] so, maybe I do have to screen scrape those eqiad and codfw config pages? ;) [09:36:16] <_joe_> no. [09:36:30] <_joe_> mwmaint1002:~$ curl -H 'Accept: application/json' lvs1015:9090/pools/wdqs-internal_80 [09:36:34] <_joe_> this is what you want [09:37:03] <_joe_> so you basically need to do as follows: accept, from configuration, a set of urls to check [09:37:32] <_joe_> then for each of them, parse the response, and fill in the pooled/not pooled status [09:37:44] <_joe_> we can make it more complicated if you want :P [09:38:05] sounds good [09:38:15] which lvs should I use for codfw? [09:38:29] <_joe_> that's a config value that I will fill in :P [09:38:34] <_joe_> all the urls, actually [09:39:14] <_joe_> this is kind of a problem I have to solve at this point [09:39:47] <_joe_> we can go quick and dirty for now, but at some point we need to expose the pooled status of services from a consistent, predictable URL somewhere [09:40:09] yup [09:40:15] <_joe_> something like [09:40:43] <_joe_> lb-status.eqiad.wmnet/status/wdqs1003.eqiad.wmnet [09:40:55] <_joe_> lb-status.eqiad.wmnet/status/wdqs1003.eqiad.wmnet/wdqs-internal [09:41:15] <_joe_> ladies and gentlemen, here comes balancoid [09:41:19] xD [09:42:01] /o\ [09:42:42] <_joe_> vgutierrez: don't worry, I'll use nodejs [09:43:38] <_joe_> jokes aside, it's like 60 lines of python we could host on kubernetes [09:43:51] indeed [09:44:30] _joe_: I had all of these sorts of things in mind when you were talking about self service stateless microservices too. But I now get that those would mainly be for outside consumption [09:44:32] <_joe_> addshore: so you can either wait for us to do this aggregated API, or do a quick and dirty version now [09:44:40] * addshore will do quick and dirty for now [09:44:56] its less dirty than screen scraping something :) [09:48:33] <_joe_> well technically you're just getting the correct info instead of just the state in etcd [09:48:46] <_joe_> which is the configuration [09:55:23] i like correct info :) [20:10:51] the icinga server itself alerts about degraded systemd state [20:11:00] looking what that is..it's been over a day though [20:11:27] Check the last execution of sync_check_icinga_contacts [20:11:44] uhm.. where does it sync them? i assume to the meta-monitoring? [20:11:46] yes [20:12:37] 0 loaded units listed. Pass --all to see loaded but inactive units, too. [20:12:53] * mutante does reset-failed anyways.. but .. [20:13:15] RECOVERY - Check systemd state on icinga1001 is OK ... shrug [20:13:21] that's... interesting [20:13:25] Nov 21 20:11:02 wikitech-static check_icinga_validate_config[17923]: Valid config, nothing to do. [20:13:30] most recent output on wikitech-static [20:13:41] what about "CRITICAL: Status of the systemd unit sync_check_icinga_contacts" [20:13:53] rescheduling checks [20:14:36] ah, it is a timer that runs once a day, and the last run failed [20:15:24] aha [20:15:34] what was the error? [20:16:12] still happening [20:16:15] it's all documented here https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring [20:16:20] I can have a look [20:16:27] I'm looking [20:16:41] Main process exited, code=exited, status=1/FAILURE [20:16:46] Failed validation of new contacts file. Aborting. [20:17:06] the latest contact added was for the IRC bot output in dcops channel [20:17:12] team-dcops etc [20:17:21] that matches the timeframe [20:17:39] chaomodus: cc: ^ [20:17:41] current_object[parts[0]] = parts[1] [20:17:41] IndexError: list index out of range [20:17:44] checking [20:19:16] we might need to blacklist it maybe, checking [20:20:11] yep [20:20:13] sending fix [20:20:36] fwiw, icinga config syntax check is happy with the contacts [20:20:38] ok [20:20:55] ahahaha [20:21:12] I just need to check why exactly I need to skip it [20:21:19] volans: I think I know why [20:21:51] tell me [20:22:04] because I'm skipping team-operations but it's in the contacts [20:22:08] not a contactgroup [20:23:13] there is more than one team- contact. they are just contacts that use a mailing list etc [20:23:44] there isn't necessarily a group with that name [20:25:14] volans: CR for you [20:25:45] trailing newline ?:p heh,ok [20:26:14] I'm pretty sure that's it [20:26:46] replied [20:26:54] it would explain why it wasn't an issue before with the other team contacts this was copied from. ack [20:26:54] I can test it I've ready the thing [20:27:39] I am pretty sure this is it [20:27:45] yeo that fix it [20:27:56] nice [20:28:04] I've commente din the CR [20:28:08] it's the piece that was added to the config, it is part of an object definition (so we are 'in_object'), there is no trailing newline in the last block, and [20:28:23] '}'.split will indeed not return multiple elements [20:29:28] I'll leave it to you from now on :) [20:29:32] * volans back off [20:29:56] uh oh [20:30:16] not what you want to hear about python code that parses an esoteric custom config format ;) [22:17:15] cdanis: lol I meant for the deploy :)