[07:05:31] hello on-callers! Following traffic's advice, I disabled puppet on some hosts (sudo cumin 'R:acme_chief::cert' "disable-puppet 'acmechief maintenance - ${USER}'" [07:05:54] as prep step to restart achechief for py3.11 upgrades [07:09:16] turns out it wasn't needed since acmechief was restarted during the prev couple of days, nevermind :D [09:48:15] hmm somebody working in the generalized puppet errors? [09:50:13] what do you mean with generalized errors? strict mode? if so, Jesse is working on that in https://phabricator.wikimedia.org/T372664 [09:50:43] or do you mean some elevated amount of errors? if so, these are caused by the puppetserver restarts for the new JRE [09:51:18] puppetserver.service needs an immediate restart after upgrading the JRE and the restart takes 30 seconds or so [09:51:34] it seems like the latter :) [09:51:39] probably that, I'm checking some errors on durum2001 that can't connect to puppetserver [09:51:41] `Failed to open TCP connection to puppetserver2003.codfw.wmnet:8140 (Connection refused - connect(2) for 10.192.14.6:8140)` [09:52:00] so it should resolve as soon as the JRE comes up [09:52:14] these should be entirely temporary, just re-run puppet [09:52:23] tnx! [10:00:19] fabfur: there's still some unrelated error on durum1002, though: wmflib::hosts2ips fails somewhere in the bird module [10:00:47] <_joe_> the JVM, as all carburetor engines, takes some time to get up to the task [10:01:13] moritzm: same as in other durum host [10:02:16] thought it was related to the puppetserver issue [10:03:55] the connection error 8140 was, but this one seems different [10:04:50] topranks: ^ I think your "Use global unicast to peer from cephosd but allow LL for BFD in" commit caused this [10:05:37] the addition of fe80::/10 breaks the type validation of srange type of firewall::service [10:05:48] moritzm: ok [10:05:52] sry..... [10:06:07] just on durum (at least, I've received alerts just for durum hosts) [10:06:33] yeah when I merged I ran puppet on the dns hosts manually to make sure as obviously they are sensitive [10:07:07] durum is on nftables though [10:07:47] cloudlb seems also affected [10:07:51] that must be it [10:07:57] same puppet error [10:08:13] let me revert for now I guess, it'll only break bfd on the cephosd nodes which is probably fine [10:08:18] cc btullis [10:08:28] and cloudlb is also on nftables [10:09:11] jayme: we are seeing some errors on purged @ codfw struggling to reach the kafka-main cluster, is it related to that kafka mirrormaker alert? [10:10:58] and seems to be spreading to other PoPs that use kafka-main@codfw like ulsfo [10:13:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071833 [10:14:17] topranks I would say let's merge and I have a `run-puppet-agent` ready on a durum host! [10:15:16] fabfur: +1 [10:15:20] it is merged now if you want to go [10:15:29] running... [10:16:07] confirmed: the error is gone [10:16:35] ok [10:16:41] I'm doing now for the cloudlb's [10:17:00] after which I'll work out how to do that so it works with either nftables or ferm [10:21:24] in the firewall::service wrapper we simply copy the unmodified srange value down to ferm (no type validation at all) and let the Ferm's perl do it's magic [10:22:18] if nftables is configured, the src/dst IPs are generated via wmflib::hosts2ip and the latter only accepts an array of Stdlib::Host [10:23:46] given that we don't need to resolve the fe80::/10 anyway (one of the main perks of hosts2ip), we could also simply add separate rule for fe80::/10 in the nftables provider case [10:23:56] vgutierrez: sorry, I did not notice hte ping here for some reason [10:24:12] no problem :) [10:25:00] for posterity: This is probably because kafka is under load due to resync of a broker (replaced hardware) [10:25:20] I think it will take another 2-3 hours for it to catch up [11:00:14] we're seeing some php-fpm saturation and eventgate errors starting from the last mw deployment [11:00:45] with a big increase in wikibase-addUsagesForPage messages being enqueued in kafka (about 2x the average) [11:01:37] we don't think it's directly related to the code that was deployed, but if eventgate is being slow because of the kafka maintenance, and it had to recreate all its connections it may explain some of it [11:07:02] jayme: maybe we should shift the partitions that have 2006 as leader to another one? [11:07:50] I don't see why 2006 is so slow really...not sure if that would help [11:08:05] brouberol: you have a "minute" by chance? [11:09:41] we're only seeing the spikes in codfw right? [11:09:46] I mean, it's a bit more loaded than usual - but it does in no way look awful [11:10:07] it's well after the p99 jumps but we just had a huge spike of refreshLinks jobs in codfw [11:10:28] starting at 10:36, peaking at 10:49 [11:10:38] but I guess that'd be api-int and not mw-web [11:11:24] eventgate errors seem to have plunged to 0 [11:11:40] mw-web p99 and saturation is back down to normal also :) [11:11:41] fpm saturation has resolved as well [11:11:49] wth... [11:12:22] the produce rount trip times have gone back to normal as well [11:12:26] s/rount/round/ [11:12:48] <_joe_> those are connected [11:13:08] <_joe_> we often send events and jobs to eventgate-main from php-fpm workers [11:13:48] yeah...question is more like why did rtt to kafka-main2006 jump to like 30s [11:14:53] we're seeing the same request choppiness at the envoy level as we have in previous times where jobs were failing to enqueue https://grafana.wikimedia.org/goto/v_71x06Sg?orgId=1 [11:15:35] network tx on kafka-main2006 dropped from 20'ish to 10'ish MB/s together with thr rtt drop [11:16:58] and one partition finished replication during that time [11:31:43] claime: fpm saturation will raise again I guess [11:31:57] now its kafka-main2005 beeing slow [11:32:15] yep, it spiked [11:32:29] fixed some graphs on the eventgate dashboard so that they now make some sense [11:32:32] but it looks like just a spike [11:32:35] namely the p99 by source [11:32:59] indeed, seems like a quick one [11:40:26] with eventgate-main having p99 of well below 500ms regularly, it seems pretty strage that we configure it with a 20s timeout in the service mesh (plus two retries, making it 60s) [11:42:21] _joe_: what would you say cutting those short to protect fpm workers in favour of potentially loosing events? [11:42:50] <_joe_> jayme: absolutely not [11:42:57] :D [11:43:26] <_joe_> as in, it might be shortened a bit but if we lose events it's a big deal [11:43:33] so you're saying this was an active decision? [11:43:40] <_joe_> yes [11:43:44] ok [11:44:10] <_joe_> all timeouts that were in the first installment of the service mesh config were intentional. I can't guarantee there weren't copypasta since :) [11:44:26] <_joe_> but eventgate-main was in that initial version [11:44:34] yeah, values are from 2021 [11:44:53] <_joe_> I'm called to lunch, it seems we're ok are we? [11:45:05] yesyes, no action required rn [12:02:17] the replaced kafka-broker is in sync now [12:23:05] sorry, I was afk for a while [12:23:09] catching up [14:01:59] I got a permission error when running puppet-merge from puppetmaster01 [14:02:01] PermissionError: [Errno 13] Permission denied: '/srv/config-master/puppet-sha1.txt' [14:02:19] it happened for both puppetserver1001 and 1002 [14:02:28] (the other puppetservers deployed ok) [14:03:19] puppet-merge on master or server? merge is still on puppetmaster [fwiw it worked for me like 10 mins ago] [14:03:34] I ran puppet-merge from puppetmaster1001 [14:03:43] so it tried deploying on each of the puppetservers [14:03:49] and failed on puppetserver1001 and 1002 [14:04:06] `dcaro@puppetmaster1001:~$ sudo puppet-merge` [14:04:28] elukey: ^ [14:05:03] context being https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071834 which I suppose is related [14:05:42] if it isn't I'm wrong of course, one hell of a coincidence [14:08:51] here I am sorry [14:09:47] yes yes it is definitely my fault [14:10:45] I am not 100% sure why but lemme check [14:11:16] ack, let me know if I can help [14:13:24] okok so puppet-merge.py creates two files at the end, containing the sha1s of the last commit for operations/puppet and labs-private [14:13:56] it creats them if the /srv/config-master path is available from what I can see, something that I've created recently [14:14:26] ahhh ok snap, of course puppet-merge.py is ran by gitpuppet on puppetservers [14:14:35] hence the permission denied [14:14:41] xd [14:17:13] if there is anything super urgent I can revert, ping me in case [14:17:44] dcaro: is it ok for you to wait some mins for the fix? [14:18:01] I have no issue waiting yep [14:18:06] no rush [14:21:57] the fix should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071884 [14:22:47] moritzm: if you have a min --^ [14:23:08] I hoped for a less impactful start of the changes to move puppet-merge to puppetserver1001, sorry folks :) [14:23:10] +1 it elukey [14:23:14] jhathaway: <3 [14:23:42] I'll explain what I am doing in a bit, after fixing this [14:23:51] +1d [14:25:00] thanks! [14:26:33] nice, is there anything I have to do to fix the failed run? (rerun puppet-merge maybe?) [14:26:34] ok should be unblocked [14:26:59] dcaro: in theory no, lemme check [14:29:34] dcaro: was your merged change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071870 ? [14:29:50] yes, that's the one [14:30:59] I see it present on puppetserver1001 and 1002, we should be ok.. they failed at the last step of puppet-merge.py, basically when the rest was done [14:31:04] sorry for the trouble [14:31:30] np, stuff happens (:, thanks for the quick fix! [14:31:53] indeed! thanks elukey [14:33:11] more detailed explanation - on puppetmaster1001, puppet-merge.py at the end creates /srv/config-master/{labsprivate-sha1.txt,puppet-sha1.txt} and they are immediately exposed via httpd [14:33:40] the config-master nodes (rendering https://config-master.wikimedia.org/) use httpd/mod-proxy to fetch those sha when rendering [14:33:46] https://config-master.wikimedia.org/puppet-sha1.txt [14:33:53] (and the labsprivate one) [14:34:18] I am adding the same config to puppetserver nodes, so that when the time comes the same data can be exposed [14:34:33] (it is used by git-sync-upstream in cloud etc.. to keep the repos in sync) [14:35:46] it was all good except for that perm issue :D [14:36:29] elukey@config-master1001:~$ curl https://puppetserver1001.eqiad.wmnet/puppet-sha1.txt [14:36:32] 7d3c4dad92c0d59b87e2f152cf996b0d8a3dd98a [14:37:38] the only issue left to fix, before rolling it out everywhere is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071881 [14:37:42] that is for envoyproxy [14:37:55] (otherwise the puppet runs on puppetserver1001 fails at first) [14:38:02] if anybody has time, I'd be very grateful :) [14:39:31] +1d elukey [14:39:46] <3 [14:46:28] change worked fine on puppetserver1003, completing the rollout :) [14:57:54] nice! [15:09:25] going afk, everything seems stable, ping me if anything looks weird :) [15:12:27] jayme is codfw kafka still misbehaving? wcqs/wdqs MaxLag alerts have been flapping for last hr or so [15:13:46] inflatador: no, it should be fine. Although the mirrormaker has not catched up yet [15:18:51] jayme ACK, thanks for the update...guessing this will take care of itself once mirrormaker clears thru its backlog [20:34:35] when a Golan app tries to makea HTTP request but then error: "context deadline exceeded"... I guess it means network was down [20:34:56] Friday we had switch maintenance, right? [20:47:02] mutante: in Golang yes, the overall timeout for the operation expired, is what that means [20:47:10] so network down is one way that message can manifest [20:50:11] cdanis: thank you. yes, I am so close to having an explanation, except the switch maintenance tickets dont mention the involved hosts [20:50:52] well "network blip" [20:51:45] I would look at metrics for TCP retransmissions, on most clusters they're usually quite low [20:51:51] and if you have an idea of the time window, you should be able to go from there to the hosts involved [20:52:14] https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?orgId=1&viewPanel=2 this panel aggregates by cluster but you can ofc edit the query and break down differently [20:53:03] thanks! [21:35:51] there was no network maintenance Friday - last Wednesday and Thursday yes [21:37:03] the hosts are all listed in the linked gsheet from the tasks - if you can't access it just ping me I can share them some other way