[00:07:46] blancadesal: dcaro: I noticed that some tools-k8s-* worker nodes are still running `v1.25.16` (e.g. tools-k8s-worker-nfs-30, tools-k8s-worker-nfs-31). This is not supposed to be so right since the upgrades are supposed to be over? to see this just log into any tools-k8s-control-* node and `kubectl get nodes` [00:13:51] andrewbogott: I want to enable the `toolsbeta-harbor-2` vm that you disabled 🙏. We can work together tomorrow perhaps to disable the alerts it's triggering. the puppet alert especially was a side effect of me deliberately disabling puppet (didn't want puppet runs messing with my configurations). Maybe there is a way to disable it without having alerts firing? [05:15:50] Raymond_Ndibe you are right! I have upgraded the missing ones now [07:54:16] did they fail or something on the first run? [07:59:53] dcaro: it was the ones that I had left over [08:00:32] Okok [08:14:40] Raymond_Ndibe: I've manually created silence 8d25256c-d2dd-4bf3-a1e0-f9e18345f87a filtering only by the instance name and with the comment starting with `ACK!` so it self-renews, that should be enough for alertmanager alerts (done so in alerts.wikimedia.org) [08:15:32] slyngs: hi! I've been seeing puppet issues in cloudinfra-idp-1, it seems to be trying to install tomcat10, but failing to do so, is that something that changed lately? [08:16:35] Yeah, I forgot about it when upgrading the production SSO server. [08:17:05] We can add profile::idp::tomcat_version: 'tomcat9' [08:17:14] That should fix it [08:17:45] let me try [08:17:52] should be upgrade to tomcat10 though? [08:18:02] That would require also upgrading the OS [08:19:14] ack, do you recommend though upgrading as soon as possible, or should we wait? [08:19:40] Sometime this year I think [08:19:58] ack [08:20:01] thanks :) [08:21:42] The CAS 6.6 is almost not supported, and anything newer requires Tomcat 10 and Java 21. So upgrading to Bookwork is just easier [08:23:51] yep, I'll add a task for it, the hiera value worked :) [08:24:33] You can thank Andrew, is was originally removed, but we need the old version to do a test install elsewhere :-) [08:25:11] dcaro: can puppet be re-enabled on tools-prometheus-[67]? just received an alert "Puppet has been disabled for 434846.71449279785 secs" [08:25:30] not yet, it will reset the config, was that on alertmanager? [08:26:05] no, email [08:26:10] I see it's acked in alertmanager [08:26:43] where is the email coming from? [08:27:47] aaahh, that's a service unit running on the VM, what's the file it says you have to create to disable it? [08:28:05] ah yes! "you can create a file under /.no-puppet-checks, that will skip the checks." [08:28:31] okok, give me a sec [08:28:35] np [08:29:02] topranks: I have rescheduled the network sync meeting tomorrow to be a bit earlier because we the WMCS team have a conflicting meeting. Let me know if that would work for you. I have something (small) to share in the meeting regarding my plans for VXLAN @ cloud [08:29:05] slyngs: can you give a quick look at T373840 and add anything I might have missed? (if you have a 'how to upgrade' somewhere that'd be awesome xd) [08:29:06] T373840: [cloudinfra] Upgrade cloudinfra-idp-* to bookworm - https://phabricator.wikimedia.org/T373840 [08:30:20] There's few new secret to add, other than that it should just work [08:30:48] I'll find the secrets that needs to be set and add them to the task (the options, not the actual secrets) [08:31:13] ack thanks, I have not setup those VMs ever, so I'll have to read up on how to do so [08:31:50] dhinus: created the files, should not notify tomorrow [08:32:17] thanks! [08:38:25] dcaro: Let me know when we get to upgrading and we'll do it together [08:39:28] slyngs: awesome :), thanks a lot [08:39:38] separate puppet alert: puppet is failing on cloudcontrol2006-dev, it's failing to checkout the latest commit from git [08:39:53] from the open-tofu repo [08:40:11] uh? [08:40:13] *tofu-infra [08:40:20] have a look at the logs, I'm not sure what's causing it [08:40:29] ok [08:40:31] I think it's the timer that checks periodically the tofu status [08:41:12] puppet maintains a copy at git_checkout_repos/cloud/cloud-vps/tofu-infra [08:41:27] it did not checkout the latest HEAD of the repo [08:41:31] (puppet) [08:41:36] yep, but why? [08:41:40] let me see [08:42:00] I'm running puppet agent by hand [08:43:07] https://www.irccloud.com/pastebin/THadROb7/ [08:43:56] maybe the awk bit is failing? [08:44:20] I don't remember why that was needed [08:45:13] aborrero@cloudcontrol2006-dev:/srv/tofu-infra $ sudo git branch -l [08:45:13] `warning: refname 'remotes/origin/HEAD' is ambiguous.` [08:45:13] main [08:45:13] * remotes/origin/HEAD [08:45:22] there's more than one ref with that name (or partially that name) [08:45:25] I think this is a mishap caused by the fautly patch from the other day [08:45:28] in the cookbook [08:45:35] ah ok [08:45:39] which patch? [08:45:43] and maybe https://stackoverflow.com/a/1692926 [08:46:08] it looks like something created a local branch called HEAD that should not exist [08:46:10] dcaro: the patch that got reverted, I think this is a side effect of that patch [08:46:12] we probably don't want a branch named `remotes/origin/HEAD` [08:46:26] arturo: I don't think so, that patch does nothing to the process that updates the git repo [08:47:03] yes, it does, the patch modified how the logic to manage the branches is implemented in the cookbook [08:47:49] wait, are we using the same repository for the cookbook tests than for the cron? [08:48:01] (as in, the same path under the cloudcontrol) [08:48:08] yes [08:48:16] I though it was creating a temporary repository [08:48:31] that's a bit troubling [08:48:43] (and prone to race conditions) [08:49:01] yes, this is in the list of things to improve [08:49:07] it does not seem to happen in eqiad [08:49:13] maybe it got on one of those race conditions [08:49:26] I would suggest removing the branch with 'remotes/*' in the name [08:49:41] though I'm not familiar with the update process [08:49:58] (also guessing that removing the repo will recreate it clean from scratch) [08:50:20] the tofu cookbook may have run only for codfw1dev at some point, not affecting eqiad1 [08:50:37] but we ran it for eqiad also several times no? [08:52:23] the cookbook failed for me several times yesterday, I can't remember if one of the times it failed it crashed on the codf1dev step [08:52:44] anyway, puppet run is clean now [08:52:49] the alert will clean itself soon [08:53:39] I have not seen it crash yet once, and ran it a few dozen times, so sounds like a race condition to me :/ [08:53:43] what did you do to fix it? [08:53:59] I reverted the patch [08:54:39] to fix the current git update repo issue [08:55:08] ah, I checked out the main branch, deleted the wrong branch, then run puppet agent [08:55:37] ack, what I though yep [08:55:50] also restarted opentofu-infra-diff.service which was also detecting the problem [09:04:43] yep, I think it was a race condition, the only process that creates a branch without a template name is the git-update, so that's the one that could have created the bad branch, and the fact that yesterday the cookbook failed for you (but not for me) points that the error was dependent on the state (ex. a race condition), matching with the issue arising from the code of the cookbook, but in the state of the system (same code run [09:04:43] in different times gives different results) [09:05:59] yeah, makes sense [09:06:49] hey [09:07:05] I think I found a possible solution to the conntrack mystery from yesterday [09:07:29] T373816 [09:07:29] T373816: Cloud VPS: investigate conntrack table usage on cloudvirt1050 - https://phabricator.wikimedia.org/T373816 [09:07:57] using this bash command I found the top 20 source IP addresses on the conntrack table [09:08:01] https://www.irccloud.com/pastebin/UsnFVNDV/ [09:08:18] 172.16.3.44 ahs 310k entries in the conntrack table [09:08:28] which is.... [09:08:30] diffscan02.automation-framework.eqiad1.wikimedia.cloud. [09:10:07] what is diffscan? [09:11:10] it's running nmap on a bunch of hosts [09:11:25] yeah, it scans the network the see a diff on open ports and such [09:12:25] so the high number of connections is expected behaviour? [09:12:34] why would it be causing trouble now and not before? (iirc it has been running for a long time) [09:14:00] cloud-vps networking changes? [09:14:09] the server had a couple of memory problems on 23 august [09:14:15] https://www.irccloud.com/pastebin/KW70ASFQ/ [09:14:33] uh interesting [09:15:49] that's the VM? [09:16:39] that's the hypervisor [09:16:55] (from dmesg) [09:16:56] which one? [09:16:59] 1050 [09:18:29] interesting yep, no reflection on the conntrack graph though [09:19:15] has the server been rebooted since? [09:19:28] no [09:19:33] (we had hardware issues on 1048 last weekend, it stopped responding completely, we rebooted it) [09:19:40] rebooting it is a good idea [09:20:41] I'm proposing this patch https://gerrit.wikimedia.org/r/1070209 [09:21:23] should it be a power of 2 of some sort? [09:22:01] I don't think it matters a lot [09:22:11] I don't think it matters at all, actually [09:25:14] it does not say anything in https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.txt , though it sets it to 4*(2^n number) by default, probably just using some extra memory if it's not 2^n [09:26:32] did we see this kind of errors in the logs? https://www.alibabacloud.com/help/en/ecs/the-application-on-the-ecs-instance-occasionally-suffers-packet-loss-and-the-kernel-log-contains-the-error-message-kernel-nf-conntrack-table-full-dropping-packet [09:26:55] dhinus: no [09:27:20] don't know how much to trust that page, but it also says "We recommend that you change the values of the nf_conntrack_buckets and nf_conntrack_max parameters together" [09:36:22] kinda makes sense what they say, though from the kernel help page I pasted, there's a max for the number of buckets that we might be hitting [09:36:29] `For systems with more than 4GB of memory it will be 65536 buckets.` [09:36:49] not sure if it's the max for the default value if none passed, or overall though, not clear from the help [09:37:03] https://www.irccloud.com/pastebin/GxLcsmWK/ [09:51:03] it is a hash table, the number of buckets only makes sense if wanted to achieve increased performance or whatever [09:52:04] I would not worry a lot about this [09:52:24] on the other hand, the conntrack_max settings exists [09:52:40] mostly because otherwise is a remote DoS condition [09:52:49] i.e, a bad actor could fill your memory [09:52:57] so the value should be under the admin control [09:53:28] if you have plenty of memory, like we have, and the current of number is low, like it is, then setting it to a higher value is just fine [09:54:40] cloudvirt1050 has 500G of RAM [09:55:01] and other hypervisors are in similar ranges [09:55:16] it will reduce performance unless paired with increasing the number of buckets, not sure if notably, you seem happy with it, I trust your judgement [09:58:37] my understanding is that performance will be reduced only if you actually increase the number of connections, i.e. we're close to the limit but we're not hitting it yet, so a higher limit should not have any immediate impact [09:59:00] and if we go beyond the current limit, worse performance is better than dropping packets (probably) [10:00:30] Not sure, even with the same number of connections, the size of the buckets is increased, so hitting a bucket still means doing a linear search inside a bigger bucket [10:05:08] that's true, it could be less optimized... at the same time, the kernel docs don't say anything about the bucket/conntrack ratio, so it's hard to know what is the ideal setting [10:05:36] they set the limit to 4*num_buckets by default, I guess that's a hint? [10:05:54] (could be clearer xd) [10:05:56] yep, but not clear if it scales :) [10:07:03] probably depends a lot on the usage you make of it, if the hash table is usually empty, usually full, jumping in between... [10:07:12] yeah, too many variables [10:07:33] I think arturo's patch is fine, we can revisit it in case we see any issues [10:08:00] of course, it was +1d since a couple minutes it was sent [10:08:56] this has some more info, but not really clarifying :D https://www.emqx.com/en/blog/emqx-performance-tuning-linux-conntrack-and-mqtt-connections [10:09:13] "The calculation rule for the default value of nf_conntrack_buckets has changed several times with the iteration of the kernel version" [10:12:45] so I guess we can keep the same 4x ratio between buckets and the max limit, if that makes you all feel more confident [10:12:46] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070209 [10:14:25] I'm fine with both options really, I don't have enough information to say which one is better, and my feeling is that both will work fine [10:14:26] LGTM too, that will require a restart though [10:14:36] no? [10:14:44] (for what I understood) [10:14:45] no [10:15:26] I don't think we need to reboot the servers [10:15:39] "# This option cannot be modified during runtime if the kernel version is not 4.19." [10:15:45] which kernel do we have? [10:15:54] but rebooting them may be a good idea anyway, if anything, just to excercise the sysctl puppet settings are correct [10:16:26] dhinus: 6.1 [10:17:52] I just noticed that with this bump, the number of conntracks in cloudgw and neutron is now _lower_ than in the cloudvirts [10:17:53] see modules/profile/manifests/wmcs/cloudgw.pp [10:18:10] so we may want to increase that as well [10:18:31] makes sense [10:19:28] what's the relationship between both? any conntrack entry in any cloudvirt ends up as a cloudgw entry? [10:19:39] (curious) [10:19:46] only if ingress/egress from our cloud network [10:19:58] HV <-> HV doesn't go to cloudgw/neutron [10:20:06] ack [10:20:09] but north/south traffic does [10:21:06] "sudo sysctl net.netfilter.nf_conntrack_buckets" shows the current value, so we can use it to double check if the change has been applied without rebooting [10:21:22] if you try to change it does sysctl fail? [10:21:31] I haven't tried [10:21:40] (might make puppet fail) [10:22:20] yeah, it can be modified via sysctl (and puppet) [10:22:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070221 [10:23:11] ha, it gets rounded up! [10:23:27] to the next 512 apparently [10:23:44] so maybe that's why we need to use a power of 2, otherwise puppet will try to change it again [10:23:47] net.netfilter.nf_conntrack_buckets = 1250304 [10:23:55] ? [10:24:00] let me try [10:24:05] I tried a smaller number [10:24:08] and it was rounded up [10:24:57] where is 1250304 coming from? [10:25:30] * dcaro lunch [10:25:32] cya in a bit [10:26:22] arturo: you can play with "sudo sysctl -w net.netfilter.nf_conntrack_buckets=XXX" and see which value you get back with "sudo sysctl net.netfilter.nf_conntrack_buckets" [10:26:30] same for net.netfilter.nf_conntrack_max [10:26:44] dhinus: I think it is the kernel that adjusts the value [10:27:18] yep, but we need to choose a value that is not getting adjusted, to avoid a puppet constant change [10:27:27] yeah [10:27:43] well puppet only deploys the file to the filesystem, but yes, is not elegant [10:28:13] ah ok, but yes let's match the final value anyway [10:39:43] ok, I think I got the right values now [10:40:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070221 [10:41:02] using a power of 2, as david suggested, the kernel doesn't mangle the number [10:53:21] if you are curious, this `roundup()` seems to be the responsible for mangling the value [10:53:22] https://elixir.bootlin.com/linux/v6.10.7/source/net/netfilter/nf_conntrack_core.c#L2563 [10:54:19] rounding to the multiple of `PAGE_SIZE / sizeof(struct hlist_nulls_head)` which may only be a coincidence if it is a power of 2 [10:58:13] * arturo goes into a rabbit hole to learn about memory page size [11:01:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070229 [11:32:36] Could I get opinions on https://toolsadmin.wikimedia.org/tools/membership/status/1784 Account is older with plenty of edits, though it has a block on it. Should we allow it? [11:49:04] rook: hmm... our guidelines say "Check the status of the associated SUL account. If the user is banned on one or more wikis → mark as Declined." https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Users_and_community [11:49:27] BUT as you say it's an older account with a lot of activity [11:49:40] the account is blocked from editing a single page on meta [11:49:48] Oh neat, we have guidelines! [11:49:55] they're a translationadmin on wikimedia.org [11:50:08] then ban is temporary [11:50:21] I'd be inclined to make an exception in this case [11:50:44] Makes sense, I'll allow them [11:51:22] I would allow it too. I don't think the user is "banned" on wiki [11:51:43] see also https://meta.wikimedia.org/wiki/Steward_requests/Global for the context (search for the user name and you will see what happened) [11:52:27] wrong link pasted, this is the right one: https://meta.wikimedia.org/w/index.php?title=Steward_requests/Global&oldid=27289651#Global_block_for_2.49.95.133 [13:08:57] this is all the 'neutron' related stats the current openstack exporter shows: [13:09:03] https://www.irccloud.com/pastebin/uBfiNaMg/ [13:09:24] it's missing openstack_neutron_agent_state [13:09:35] arturo: I don't immediately know what that is but let me dig through history a bit [13:12:37] arturo: ok, I think this is a side effect of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1066809 [13:13:22] The question is if I can make those stats visible via policy change or if it specifically requires the admin endpoing [13:14:54] hm, if there are permission errors they aren't in the neutron logs [13:17:16] maybe it's checking if it has capabilities before trying [13:20:53] heh, I'm searching the logs for '403' and found a whole of requests that take 403 ms to fulfill [13:20:55] not helpful! [13:21:04] xd [13:21:27] hahahaha [13:39:02] arturo: I'm stuck in the middle of another thing. I'll try to get back to the neutron metric thing but you can also make me a ticket :) [13:51:15] andrewbogott: I don't think this is urgent for me [13:51:25] that's good :) [13:52:24] wait, I misread [13:52:42] yes, this is interesting and should be fixed, let me open a ticket [13:55:14] andrewbogott: T373878 [13:55:15] T373878: openstack: fix missing prometheus metrics - https://phabricator.wikimedia.org/T373878 [13:55:30] thanks! [13:58:32] quick review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1070258 [14:13:06] merging [14:58:35] I'm looking to setup a new tofu infra test project without - in the name to allow for testing object stores. I've got it mostly setup, though it doesn't seem to trigger alerts. In the previous version a /var/lib/prometheus/node.d/tofu-apply.prom file was created and that would bring up an alert on prometheus-alerts.wmcloud.org doesn't seem to be working, what am I missing? [14:59:29] Rook: the - in project names thing is resolved in new projects (which now use uuid for their ID) [14:59:41] Neat! [14:59:42] so it's only old projects that have broken object store [14:59:55] But, ok, now I have read the rest of your question and don't know the answer :) [14:59:56] Oh, so I'm in the same position then [15:01:07] Though I guess I can name it something other than tofuinfratest. tofu-infra-test, testing-with-tofu, soy-tofu-test [15:01:24] tofuinfratest seems fine [15:01:59] I imagine that there's a setting someplace in a config that says "please monitor this project" but I'd have to grep extensively to find it. dcaro probably knows offhand [15:04:41] * dcaro reading [15:05:18] that's the metricsinfra setup, it has to be added in the DB [15:05:29] https://wikitech.wikimedia.org/wiki/Metricsinfra [15:05:38] I can give it a look if nobody wants to give it a go [15:05:48] oh right, messier than I remembered :/ [15:09:01] yes I can see the alerts for tf-infra-test there [15:09:41] tl;dr "ssh metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud", then "sudo -i mariadb", then "select * from alerts" [15:10:19] then ugly SQL like "update alerts set ... where...' [15:22:57] I think you'll have to update also the projects to flag it as owned by wmcs and such [15:23:07] otherwise it does not add the `team=wmcs` label [15:29:11] How does one identify a projects id? [15:30:35] it's in the project table iirc (manually assigned, internal to the metricsinfra application) [15:31:29] There it is. Thank you! [16:20:12] * arturo offline [16:20:54] hmm. kyverno got upgraded on tools, but helm flagged the install as failed [16:21:02] tests are passing ok [16:21:15] and the pods are running the newer version :/ [16:21:42] a redeploy does not seem to clear the flag (and it works ok) [16:44:55] * dhinus offline [16:44:57] exit [16:45:04] LOL :D [16:45:12] I'm tired [17:30:13] * dcaro off too [17:30:17] cya tomorrow [21:48:35] bd808: when you have a moment... I'm having a fight with apache config. Horizon is forwarding to CAS properly now, but because horizon is behind envoy (which terminates ssl), CAS thinks that the referring url is http://horizon rather than https://horizon. Which means when it bounces things back to horizon it does so without https... [21:48:52] I assume there's some way to tell a vhost "you might think your url is http but really it was https"? [21:50:07] this is with mod_auth_openidc [21:54:51] An X-Forwarded-Proto header is maybe what you are looking for? [21:57:14] like "RequestHeader set X-Forwarded-Proto "https"" ? [21:57:19] that's what I'm trying now, no luck so far [21:57:48] Which component in the stack is sad about not seeing TLS? [21:57:54] if you go to labtesthorizon.wikimedia.org you can see the very bad behavior [21:58:12] I believe it is the browser doing a post to [21:58:47] not sure I'm making sense... [21:59:01] It is of course ultimately posted to https. But in the meantime the browser is alarmed. [22:00:18] In the cas logs I see 'pac4jRequestedUrl=http://cloudidp-dev.wikimedia.org/oidc/.well-known/oidcAuthorize?' [22:01:56] hm, now I'm less certain about where that is coming from [22:09:38] I see "https://openstack.codfw1dev.wikimediacloud.org:25000/v3/auth/OS-FEDERATION/websso/openid?origin=http://labtesthorizon.wikimedia.org/auth/websso/" -- that origin argument is the problem then? [22:10:08] yes, I think so! That is certainly what it complains about loading a moment later [22:12:57] My config current has [22:13:00] https://www.irccloud.com/pastebin/3RK9mv2Y/ [22:13:10] Which fixed many issues similar to this one, but not this one [22:36:03] I need to step away but feel free to mess with /etc/apache2/sites-enabled/50-proxy-wsgi-keystone.conf on cloudcontrol2004-dev.codfw.wmnet, it's the only apache in play at the moment. (And also feel free to ignore this silly problem!) [23:33:31] a.ndrewbogott: I'm running around in circles looking in puppet for a known working OIDC integration. The netbox one unfortunately is in Django rather than external Apache2 auth like you are working on. Something definitely is weird in various non-TLS "http" urls that show up when trying to use the flow. [23:51:54] grrr... the OIDC for superset is all python code too. (Flask this time instead of Django) [23:53:08] Yeah, I think even though mod_auth_openidc is the 'obvious' way to do it no one else does it that way [23:53:36] * andrewbogott is back but has no new ideas [23:54:29] you're seeing various http urls, not just that one one origin?