[06:19:40] good morning [06:19:59] I cannot reach the mgmt console of mc2028, the host is down [06:20:43] impact is very limited, the mcrouters in codfw are showing tkos for the failed shard [06:20:46] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=codfw%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&from=now-6h&to=now [06:20:57] (so basically what gets replicated to codfw) [06:21:42] but we are currently using the codfw gutter [06:21:43] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All&from=now-6h&to=now [06:21:51] (I think it is the first time) [06:23:46] ah and also in theory redis on mc1028 is not replicated anymore to codfw due to host down [06:25:13] <_joe_> elukey: yes, that's a problem we probably want to solve before the dc switchover [06:25:22] <_joe_> rzl / effie ^^ [06:26:12] * volans double checking is not a dns issue due to recent automations [06:27:39] all looks good on that side (record matches old manual one) [06:31:32] according to icinga both went down at the same time more or less [06:31:32] (host and mgmt) [06:35:31] volans: yep I see the port down on the switch [06:40:22] opening a task [06:41:29] thx [06:44:39] https://phabricator.wikimedia.org/T260224 [06:45:16] in theory we can wait in this state that Papaul checks what happened, and remove the host from the mcrouter config only if it is a permanent failure [06:53:01] <_joe_> yes [06:53:13] <_joe_> also if it's permanent we can go with an async replication for redis [06:53:28] <_joe_> puppet supports having multiple shards on a single server, in case of need [09:06:39] godog: found a new weirdness in pontoon - it looks like maybe rsyslog is broken [09:07:50] kormat: oh! broken badly ? [09:08:08] godog: only in that it stopped logging anything 3-4 days ago [09:10:01] i tried restarting it on one machine; on the third restart, it finally managed: [09:10:06] Aug 9 02:23:41 zarcillo0 diamond[25494]: Queue full, check handlers for delays [09:10:06] Aug 12 08:58:32 zarcillo0 systemd[1]: Stopping System Logging Service... [09:10:12] and that was it, nothing further [09:12:49] :( ok LMK if you find sth obvious, I haven't run into that yet, will be able to take a look later [09:24:53] godog: the issue is caused by /etc/rsyslog.d/30-remote-syslog.conf [09:25:00] if i delete that file and restart rsyslog, then it functions again [10:52:07] godog: the victorops app just stopped working for me saying that my SSO credentials have expired, however, the same user/pass does work on portal.victorops.com [10:55:49] This is weird it worked again without me doing anything :-/ [10:57:51] it happend to me something similar, I closed the app and relogged in for it to work (for some reason it wanted to use SSO?) [11:00:36] yeah, the second time it didn't ask me for user/pass it just logged in [12:01:39] marostegui: gah :( so it "recovered" without doing anything [12:02:28] yep [12:11:23] godog: fuh. getting cumin to work in pontoon is a royal clusterfuck (pun intended) [12:14:40] godog: if there was some way of having a per-pontoon-stack 'private' repo, or being able to override parts of the existing labs/private one, that would remove a huge headache [12:14:48] (but i'm kinda guessing the answer is 'lolno') [12:23:57] kormat: mmhh we could arrange that yeah, what's the issue(s) ATM ? [12:24:53] godog: labs/private contains a bogus cumin_master ssh key. that gets installed on the cumin pontoon host, and cumin (the tool) tries to use it [12:25:21] and from there everything gets terrible [12:25:47] if i overwrite the key on the cumin host, puppet nukes it next run [12:26:02] so i need to have puppet itself honour the key [12:26:45] mmhh yeah, I wonder if having a valid keypair in there would make the problem better or ideally go away ? [12:26:54] as opposed to SNAKEOIL [12:27:19] soo, it would 'fix' the problem, but it would also publish a key that has root access to a bunch of VMs [12:27:57] ok so that's a non-starter obviously [12:28:34] yeah :) [12:28:58] oh, huh [12:29:55] a 'solution' comes to mind: manage /var/lib/git/labs/private/ on the pontoon puppetmaster in the same way we manage the puppt repo [12:30:10] that way we can put secrets in there that will never leave the pontoon project [12:31:35] yeah that might work, at least it is a start [12:31:55] i'll give it a shot. thanks for being my 🦆 [12:32:58] for sure! [12:35:00] FWIW the pie in the sky in my mind for the "private repo" story is to have a public description/manifest of the private material we want and then we can (re)generate it ad-hoc anytime we want [12:48:48] oh thank $deity. it works \o/ [12:54:55] neat kormat! [12:55:25] godog: i spent 4h on this before having that 💡 moment. this is a big relief [12:56:06] kormat: easy to believe! ducks seldom disappoint [12:56:21] :D [13:02:55] godog: re: pie in the sky, that's pretty much what i had done for $lastjob. the fake private data was generated locally for that specific test env [13:04:23] aye, that'd be The Way™ to do it [13:05:17] https://i.redd.it/mrvjjbpi2p541.jpg [13:05:25] haha [20:46:58] I'm working on moving some logic from the `Cookbooks` repo to `spicerack`. The method I'm moving from cookbooks makes a prometheus query with `spicerack.prometheus()`, how can I access the same object within the spicerack repo? [20:47:32] From https://doc.wikimedia.org/spicerack/master/introduction.html#spicerack-automation-framework-for-the-wmf-production-infrastructure it looks like Cookbooks define a `run(args, spicerack)` function that gets called, but I'm not sure what originally creates/passes in the `spicerack` object [20:48:58] hey ryankemper [20:49:20] hey [20:49:47] I wonder if I can just do something like `from spicerack.prometheus import prometheus` [20:50:01] so, from the Cookbooks PoV everything is accessible via the spicerack object that is an instance of Spicerack() [20:50:28] setup by the cookbook.py before calling the specific cookbook [20:50:42] as described in https://doc.wikimedia.org/spicerack/master/introduction.html [20:50:42] right [20:51:12] now, you can totally import other spicerack modules from whithin spicerack [20:51:37] and in this case the Prometheus class doesn't have any __init__ that requires specific parameters [20:51:48] looks like there's a https://doc.wikimedia.org/spicerack/master/api/spicerack.prometheus.html, so probably something like `from spicerack.prometheus import Prometheus` [20:51:53] so you can totally just from spicerack.prometheus import Prometheus [20:52:06] and then use Prometheus.query() [20:52:18] perfect, thanks for explaining that [20:52:23] *but*, if what you're doing [20:52:39] can be easily generalized maybe is worth to add to the prometheus module itself [20:52:43] not sure what you want to do [20:52:51] ofc if it's specific to ES stuff [20:52:54] Yeah, in this case it's a very specific elasticsearch use case where we need to make a certain query [20:52:55] it's ok to have it there [20:53:02] k [21:01:20] ryankemper: for development you can run 'tox' locally that runs all the checks of CI (as long as you have at least one python version of the ones supported) [21:01:34] Ci will run them for all versions, to be clear [21:01:50] if you need to re-run a single one: tox -e py37-unit # for example [21:02:31] and to run only specific tests tox -e py37-unit -- -k test_elasticsearch_cluster [21:02:34] for example [21:05:01] lmk if you need an hand for those or the type hint stuff [21:23:19] volans: thanks I was just looking into running it locally [21:23:25] that will help a lot [21:23:53] tox -av # to list all envs is also helpful [21:29:26] you can also cheat, ryankemper, and use the tox venv as a venv to do other local testing of your changes :) [21:43:46] I always do that, I have tox manage my venvs, I just . .tox/py37-unit/bin/activate [21:43:50] and do stuff :D [21:43:55] yeah it's great