[07:43:53] so pebcak of the day: I didn't know that some icinga alerts have the complete warn/crit message on multi-lines, so only expanding the alert to get details brings you to what is needed (like the host list etc..) [07:43:57] so I updated [07:43:58] https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate [07:44:04] https://wikitech.wikimedia.org/wiki/DNS/Discovery#Discrepancy [14:00:28] Something weird is happening for me with tmux on cumin1001 [14:01:24] when executing commands with sudo, the TMUX env var is apparently unset, breaking e.g. reimaging and similar scirpts. I can of course workaround by sudo -i to a shell and then running `TMUX=bla script args`, but AIUI, that should not be required. [14:03:02] klausman: do you have a minimal example of it not working? [14:03:19] sec [14:03:52] https://phabricator.wikimedia.org/P14483 [14:05:42] klausman: i get the same results. but i don't have any _issues_ [14:06:08] So `sudo wmf-auto-reimage ...` would just work for you? [14:06:31] i haven't run that particular command in a while, but i haven't had any issues in the past [14:07:35] I've extended the paste, can you try the command and see if it works for you? [14:09:40] klausman: it works for me [14:09:44] wtf. [14:10:54] klausman: what's your $TERM? https://phabricator.wikimedia.org/P14483#78232 [14:10:59] Must be something in my tmux conf. [14:11:02] Yeah. [14:11:10] i have `screen-256color` [14:11:18] checking for $TERM is a bit... I mean there is a TMUX var for a reason [14:11:29] sure. but it doesn't get propagated to root :P [14:11:58] Yeah, so the script assumes that it is. [14:12:06] blame volans [14:12:14] I leave that to an expert like you [14:13:42] Did you ctrl-c the creation or let it proceed? [14:14:05] i let it proceed. just typed 'go' at the dns prompt. [14:14:12] ok [14:15:32] klausman: what _is_ your $TERM btw? [14:15:47] tmux-256color [14:18:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/666899 [14:18:11] Who would be the right reviewer? [14:19:37] klausman: have adde me and moritz could you also add STY while you are there [14:19:51] the existing $HOME has already caused massive issues. i feel nervous about adding other vars there. [14:20:51] * volans feels summoned, what's up? [14:21:45] volans: the detection if "is in screen/tmux?" is bad, and should feel bad [14:22:11] kormat: do you have proposals to improve it? [14:22:23] patches are welcome! ;) [14:22:47] tl;dr ensure_shell_is_durable() dosn;t see the useres STY or TMUX env variables [14:22:49] volans: simplest fix would be to also look for 'tmux' in $TERM. if you wanted to be fancy, looking at parent processes to find screen/tmux would be better [14:23:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/666899 One proposed patch. [14:23:16] * when using with sudo [14:23:27] Are we sure ensure_shell_is_durable() is the only place this breaks? [14:23:42] kormat: we do check and 'screen' not in os.getenv('TERM', '')) [14:23:53] if tmux can be there too let's add it [14:23:54] volans: yes. we do not check for 'tmux'. [14:23:59] 👍 [14:24:31] If we do that we should also remove the TMUX check in the same place since it is misleading [14:24:32] kormat: can there be a false positive? have 'tmux' in TERM but not being in a TMUX session? [14:24:59] IIRC I tested both the tmux then sudo and the sudo then tmux [14:25:25] Your tmux config likely doesn't change TERM, so it would be screen-something [14:25:28] Mine does. [14:25:44] i use tmux but have TERM=screen whic is why it workls for me fyi [14:26:48] I've had assorted tools assume screen can't ever be more than 16 colors. [14:27:04] (yes, even with screen-256colors) [14:27:26] I mean, I can change my puppet-copy of tmux.conf and we'll ignore all of this. [14:30:45] jbond42: that might be the default, I don't recall having done anything special to have it [14:30:48] and I get the same [14:31:01] "< volans> patches are welcome! ;)" https://gerrit.wikimedia.org/r/c/operations/software/pywmflib/+/666902 [14:31:23] jbond42: can we have it without the extra formatting? [14:31:28] now available with reformatting! [14:31:46] god damm black intergration :P [14:32:16] IMHO the pre-commit or pre-review hook per-repository is better ;) [14:32:41] any pre-commit hook would do exacly the same thing thugh right? [14:32:55] would be only on spicerack and not on wmflib [14:33:14] on a per-repository basis [14:33:17] not a fan of pre-commit hooks for reformatting [14:33:25] <_joe_> same [14:33:26] unless it would just error if the formatting isn't correct [14:33:28] warning about it, maybe [14:33:34] I went the pre-review way for now [14:33:38] same it means i have to install a pre-commit hook for every repo. would much prefer to just add it to my editor [14:33:42] I alwyas check the diff on gerrit anyway [14:33:46] <_joe_> I use format-on-save since I started to use golang [14:33:53] <_joe_> it's incredibly useful [14:34:00] I wonder how complex it would be to have a format-check for the changed lines only. [14:34:15] klausman: <----> (not to scale) [14:34:18] <_joe_> klausman: depends on the formatter [14:34:18] Or *added* lines, really. [14:34:37] <_joe_> klausman: black specifically would most likely need to be rewritten [14:35:01] The cheap/crappy way is to make it into a warning and hope the human DTRT [14:35:34] <_joe_> IMO the correct way of unifying on a formatter is to make one formatting commit [14:35:48] <_joe_> and then have CI reject any not-well-formatted change [14:35:49] ofc that's the only way [14:36:01] and add that to a file to ignore it on git blame [14:36:10] <_joe_> and yes, I keep format-on-save overrides per-repo [14:36:22] <_joe_> volans: add what? [14:36:41] https://doc.wikimedia.org/spicerack/master/development.html#git-blame [14:37:17] <_joe_> oh you mean the formatting commit [14:37:25] <_joe_> yeah [14:37:26] yes, sorry forgot the subject ;) [14:37:38] <_joe_> yeah your sentence just didn't compute :P [14:42:25] volans: last time i looked into pre-receive hooks it was there were a lot of warnings that one shuld not modify the commit in this hook as it can cause circuler dependncies and other issues. im also not sure how it would interact with signed commits [14:43:43] jbond42: why pre-receive? [14:43:57] in the doc I suggest pre-commit or pre-review [14:45:49] volans: ahh ok not familure with pre-review hook but if its a client side hook then again one has to install it for each repo and there is no sound way to force a client side hook (i also dont think its in the spirit of git to force client side hooks but lets not tangent again) if its a server side hook same issue as pre-recive [14:48:12] it's a client hook and you have to install because is bad to impose those, it's also up to you because maybe you have a different integration with your editor/IDE [14:48:16] so it's totally up to you [14:48:25] what CI does is to enforce the syntax is correct [14:49:01] this is for spicerack, because it was migrated to use black as formatter recently as an experiment to see how it goes with everyone [14:53:00] volans: ack perhaps i was over zelous in adding autocmd BufWritePre *.py execute ':Black' [14:55:13] :) [15:54:05] akosiaris: any success/sob stories re: kubernetes-node.cfg? [16:07:33] more restbase issues - throttling again on bcl.wikipedia.org, wikitext to html transformation from de.wikipedia.org [16:08:42] kormat: sorry I was inbetween meeting and forgot about that shell in cumin [16:08:46] I am happy to say it says [16:08:48] 15:49:36 | kubestage2002.codfw.wmnet | [info] kubestage2002 done. [16:08:48] 15:49:36 | kubestage2002.codfw.wmnet | Reimage completed [16:08:48] 15:49:36 | kubestage2002.codfw.wmnet | REIMAGE END | retcode=0 [16:08:48] 15:49:42 | END [16:08:57] so +1 :-) [16:09:04] excellent 🎉 [16:17:06] <_joe_> hnowlan: what is bcl? [16:17:43] _joe_: cebuano [16:18:12] <_joe_> ok, one thing we can do to get better info [16:18:21] <_joe_> is to make envoy log 429s [16:19:13] hnowlan: one thing I'm still unclear on from looking yesterday: [16:19:16] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-restbase-2021.02.24?id=H0tT1XcBsCn0xdb8Djvf [16:19:16] could someone be doing a bulk, automated translate from one wiki to another? doesn't look like there's lots of changes on that wiki recently though [16:19:42] ^ the ::ffff:10.64.0.100 in that logstash (which is rb1019's IP, but in a strange form) - is that from an rb->rb request? [16:19:51] <_joe_> bblack: no [16:20:03] <_joe_> at face value, that's a request from VE [16:20:05] are the root urls still de wikipedia? [16:20:07] <_joe_> so a client [16:20:11] apergos: yep [16:20:34] huh same as yesterday then [16:20:37] so why does the client-IP field, as reported to logstash by rb (I'm assuming?), have an rb server's IP? [16:20:59] that's what's confusing me about that [16:21:00] <_joe_> I would make no assumption about how restbase logs what [16:21:13] <_joe_> but yes, if we have a rule that makes restbase call its lvs [16:21:16] <_joe_> ip [16:21:26] <_joe_> (which I don't think we do) [16:21:32] <_joe_> then you might see that [16:21:36] either way, the more-relevant question is: could the odd form of its own IP there be fooling its own ratelimiter? [16:21:50] <_joe_> bblack: possibly, if that is not the actual IP [16:21:58] i.e. there's supposed to be an exclusion to ratelimiting for internal svc->svc stuff, and this fails that check when it shouldn't [16:22:11] <_joe_> bblack: no I don't think there is that exclusion [16:22:29] <_joe_> but at this point this looks like something we have to involve rb devs into [16:22:38] well, there is such a ratelimit + exclusion at the cache layer [16:22:56] but that would only apply if this is looping back through the public caches and keeping this ::ffff: thing in XFF [16:23:16] how many requests internal or not can it handle anyways? maybe that needs to be a hard limit [16:23:33] except that there are retrieval requests from the okapi folks but that's a separate deal [16:23:43] maybe I can dig in turnilo to confirm whether these IPs show up at the cache layer [16:28:52] hmmm no, they don't appear to. at least not often enough that they show up in 1/128 [18:51:41] Anyone know what condition leads to a node being removed from `puppetdb`? I don't see `wdqs2008` in the puppetdb, but its last puppet run was 11 days ago, and I thought it took ~2 weeks for a host to get removed from puppetdb? [18:51:57] (Unless it's not puppet not having been ran in 2 weeks but some other condition that leads to the auto-removal) [18:54:29] ryankemper: it's removed after *one* week without running puppet, according to https://wikitech.wikimedia.org/wiki/Puppet [18:56:32] rzl: thanks, was having trouble finding where in the docs it said that [18:58:12] yeah, it's also considered something of an antipattern to leave puppet disabled for that long on a host on the prod network [19:15:22] ryankemper: puppet disabled for too long [19:16:27] we wanted to discuss this topic specifically yesterday on the SRE-I/F meeting but john was out and didn't make much sense to do it without him [19:17:30] I know your need and the wdqs reloads that take so long, but at the same time having puppet disabled for so long as a normal workflow feels wrong. So, while we discuss the options for that in general I'd like to know if there are other options on your side too [19:18:10] that would allow to keep puppet running during wdqs reloads, maybe having a live state saved somewhere like we do for pooled/depooled [19:45:19] it should be 14 days now, the doc is probably outdated :) [19:46:48] volans: yeah, I suggested a hiera bit to flip there https://phabricator.wikimedia.org/T267927#6856302 [19:47:24] and John have a patch for better alerting there https://gerrit.wikimedia.org/r/c/operations/puppet/+/666950