[01:54:26] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:54:26] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:49:50] Morning. Would someone be able to help us with allocating some IP ranges for the new dse-k8s-codfw cluster, please? T400037 [09:49:51] T400037: Determine dse-k8s-codfw Kubernetes IP ranges - https://phabricator.wikimedia.org/T400037 [09:54:26] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:54:28] Ideally, for IPv4, we would like a `/20` (services) and a `/21` (pods) in here https://netbox.wikimedia.org/ipam/prefixes/379/prefixes/ to match the values for dse-k8s-eqiad. [09:55:50] But I wouldn't like to make any changes in netbox without suitable consultation. [10:09:41] Hey Ben! I think it should be a matter of allocating a /18 right? [10:09:48] to then be split again [10:10:20] elukey: Yes, I think so. That's what ml-serve and aux are doing. I just wanted to check before doing anything. [10:10:58] yep yep, wikikube has a bigger one but it makes sense [10:12:13] I think that in general we should try to avoid as much as possible wasting addresses, but you can definitely make a reservation and then ask the review from Cathal or Arzhel. Worst case we'll delete it and re-create another one :) [10:12:45] I can review it as well if you want but I compared to them I have less than zero authority :D [10:12:57] I guess it comes down to what is “waste” [10:13:32] the biggest risk in my view is making too small allocations and having to renumber down the road [10:14:08] Yeah, originally, the large (/20) for service addresses was because we thought that knative-serving might be deployed to these clusters: https://phabricator.wikimedia.org/T310169#7992185 [10:14:10] 10/8 is large, I glanced at this it seems to make sense, I’ll take a detailed look shortly [10:14:28] That hasn't happened yet, but it could still theoretically happen. [10:14:42] topranks: Many thanks. [10:19:09] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:20:42] I think nowadays we do /20 and /21 to guarantee future growth without renumbering [10:21:07] I wouldn't go with something less [10:21:34] Ack, thanks. [10:26:19] in other news, kartotherian and tegola (maps) are running using the maps-test2* cluster in codfw, all Bookworm based [10:26:29] still not getting live traffic, but fingers crossed [10:39:13] FIRING: [7x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:44:09] FIRING: [7x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:48:02] elukey: am I correct in thinking the default POD IP range per host is a /26? So a /21 POD IP allocation allows for up to 32 hosts in the cluster? [10:49:09] FIRING: [7x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:59:09] FIRING: [5x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:04:09] RESOLVED: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:08:10] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Phabricator should use IDP for developer account logins - https://phabricator.wikimedia.org/T377061#11019974 (10Aklapper) Tyler mentioned CAS for Spiderpig implementation; maybe parts are reusable? https://gitlab.wikimedia.org/repos/releng/scap/-/blob/m... [11:28:01] topranks: o/ We moved to /20 and /21 in all clusters except Wikikube to allow for growth, IIRC with a /21 pod ip allocation we are able to spin up ~2000 pods. What do you mean with 32 hosts in the cluster? [11:28:30] I mean each host announces a fixed /26 subnet to the network right? [11:29:08] like even if it only has one pod on it, the host will still use a /26 ? [11:29:56] good question, I don't recall [11:30:08] you mean calico doing BGP with the tor or the routers [11:30:26] yeah I think that is how it works, so you do get 2048 IPs for PODs, but probably the more meaningful scaling factor is it gives you 32 x /26 networks so that number of maximum hosts [11:30:46] elukey: yep what calico announces to the routers [11:33:15] okok this bit wasn't clear to me [13:48:44] 10netops, 06Infrastructure-Foundations, 06SRE: BGP: Support receipt of graceful-shutdown community and set local-pref - https://phabricator.wikimedia.org/T399931#11020341 (10cmooney) 05Open→03Resolved a:03cmooney [15:45:52] elukey: which ml nodes can use to look at the nvme uefi issue? [15:51:12] jhathaway: ml-serve1012 is currenty in d-i [15:51:26] all details here https://phabricator.wikimedia.org/T393948 [15:58:58] thanks [16:05:05] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:02:12] Wiki Education has been getting a number of reports from new users who don't receive newly-required confirmation emails from auth.wikimedia.org. This seems to be primarily an issue with university email systems (so using a personal email has been a successful workaround so far), but we've had reports from at least three different universities. [17:03:10] as the fall semester starts, we're likely to be fielding a lot more emails about new users who can't confirm their emails, if this is a widespread problem. [17:03:57] any advice welcome. [17:10:05] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:39:53] 10Mail, 06Infrastructure-Foundations, 06SRE, 10SRE-Access-Requests: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11021555 (10nisrael) @Aklapper my apologies! I will make a note to myself to do this for future tasks! [19:12:33] 10CAS-SSO, 06cloud-services-team, 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#11021861 (10Arendpieter) [21:58:09] I can't seem to find `systemd-standalone-tmpfiles` in our Bullseye repo anymore, does anyone know what happened? I've been building a docker image that needs the package, ref https://gitlab.wikimedia.org/repos/data-engineering/opensearch/-/blob/main/blubber.yaml?ref_type=heads#L25 [23:33:20] I think it's related to the deprecation of bullseye-backports, which got removed recently. You can use this base image instead for your opensearch container. https://docker-registry.wikimedia.org/openjdk-17-jre/tags/ [23:38:41] ACK, I was wondering if it has something to do with backports, but the message I found was from a year ago so I didn't think it was that