[08:51:08] Good morning! I'm back from a long PTO, so I have to catch up with a bunch of emails, but feel free to ping me/redirect me if there's anything you need (might also miss things in email, got ~5k) [08:51:28] welcome back dcaro! [08:51:54] I suggest declaring inbox banckruptcy and go on :D [08:52:48] yep, that might be a good way of moving forward xd [09:03:53] hey dcaro welcome back! [09:04:07] thanks :) [09:12:01] welcome back! [09:48:47] anyone has insight on the pki project? It's getting it's certs expired and showing up in our alerts, but I'm not familiar with it (looking for info now, but in case anyone already knows xd) [09:52:46] I suspect it should not be showing up in our dashboard, it's an old project. volans has admin though [09:54:35] hmm, what's expiring is the puppet certificate for it, not the puppet server of the project xd, the alert subject is a bit confusing [09:58:45] are those the VMs that someone dist-upgraded from buster instead of re-creating and now they're getting older that VMs usually do? [10:00:17] probably, let me check [10:01:13] one of them fails to run puppet :/, looking [10:03:26] they have been around since 2021 it seems, and they show as buster image though running bullseye so yep [10:13:41] on the toolforge logs side, there's an increase on the amount of storage used for logs starting on thursday, anyone familiar with that? (https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas?orgId=1&from=now-7d&to=now&timezone=utc) [10:19:22] might be https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1136 ? [10:20:04] hmm, is tracing going to the same bucket as regular logs? [10:20:05] no, it's volans's fix to make tracing actually trace the events it's supposed to trace [10:20:29] * volans sorry in meeting, will read backlog in a bit [10:20:34] ack [10:21:29] I don't see any more suspicious changes, looking at the cluster/logs [10:33:35] * volans back [10:34:35] dcaro: the nfs tracing goes to a different loki that has its own ceph buckets [10:35:04] see https://horizon.wikimedia.org/project/containers/ [10:35:13] (infra-tracing-*) [10:36:09] ah from grafana I see you're monitoring per-project, not per-bucket group [10:37:52] in reality though the nfs tracinf is tracing less data, not more, because we decided to skip tracing tools usage of their own hoem [10:37:55] *home [10:38:48] ack [10:39:06] but I'm checking the graph [10:39:25] yep, it might be just user activity, I see also a bump in jobs-api mem usage around that same time [10:39:27] https://usercontent.irccloud-cdn.com/file/Yy9cmN2G/image.png [10:39:29] because I did deploy that on Thu afternoon [10:43:58] hmm.... might be ceph stuff... the total amount of space used does not really go down much, even though the objects got cleaned up [10:44:01] https://usercontent.irccloud-cdn.com/file/NnYZOvr8/image.png [10:44:09] (that's in the last 90 days) [10:44:58] I'll create a new task to put this stuff in [10:45:08] I can confirm that on the 6th we stored ~25% less lines than the 4th (pre vs post change) [10:59:50] but let me know if I might be the cause ;) [11:03:00] I don't think so, looking [11:15:09] * dcaro lunch [13:00:11] hmm... so it turns out that the infra and tools buckets are under the same user, and share the quota [13:00:26] https://www.irccloud.com/pastebin/CbgfGl7n/ [13:00:54] https://www.irccloud.com/pastebin/rI9ZKxhD/ [13:02:00] let me see if I can get stats per-bucket, the infra one is now >25G [13:13:17] nah, we don't have usage logs enabled (potentially uses a lot of storage) [13:16:03] at least horizon does show you per-bucket usage stats [13:16:39] tracing chunks is ~25G, tool logs chunks is ~55G [13:16:44] yep [13:17:13] I can see the current sizes, but not historical ones (that's what I meant with stats, sorry) [13:20:17] I could try parsing the mtimes for each file in the buckets xd [13:22:26] there's a considerable increase in the amount of "entries" in the bucket since the patch volans did [13:22:31] https://www.irccloud.com/pastebin/zxq5HrW7/ [13:22:58] So I'm leaning towards thinking that though not the only thing, it's what made it increase the quota usage in the last week [13:26:45] dcaro: the infra-tracing one has not yet reached the retention limit in loki, so it's also normal that it's growing [13:27:32] what's the expected retention? (so I can increase the quota for now, we might want to split them off to avoid users quota impacting the infra, and infra impacting users) [13:28:18] it was decided to keep 60d as tools might use dumps only once a month or similar and we might want to not miss that [13:29:07] what was the previous retention? [13:29:19] 60d is from the start [13:29:32] but we've not yet reached the 2 months since it has been deployed [13:29:43] ahh, okok [13:29:47] but we're close [13:31:32] hmmm.... the patch should not have impacted the amount if data stored either right? (maybe I'm misunderstanding it) [13:31:43] maybe there's a tool that started spamming NFS? [13:32:13] yes but in two different ways, from one side we're not tracking anymore tools accessing their own tool's home [13:32:27] from the other we're tracking nfs access from all known symlinks/mountpoints [13:32:33] while before only the canonical one [13:32:41] so from one side we track less, from the other more [13:36:55] ack [13:37:06] * andrewbogott yawns and waves [13:37:09] xd, we don't have enough logs on the logging infrastructure [13:37:13] wb dcaro! [13:37:16] andrewbogott: \o [13:37:47] the grafana dashboard for the infra loki does not have enough retention to check the 5th feb usage [13:38:11] oh, maybe it does! I loaded something [13:38:52] in theory with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1212186 we4 should also have metrics in prometheus, but I'm not finding them right now [13:38:54] I think it might be the .shared folder usage, it spiked several orders of magnitude since that day [13:39:00] surely looking into the wrong place [13:41:52] not order of magnitude, we were alreadyu tracking a lot of activity from pagepile (IIRC is the one compiling rust or go all the time) [13:42:20] https://usercontent.irccloud-cdn.com/file/fdOrT1Wl/image.png [13:42:40] select pagepile from the variables [13:42:45] open the top graph [13:42:46] select 1 week [13:45:19] 32452535 lines from 6th 00 to 9th 12; 24521901 from 1st 00 to 4th 12. thats a 32% more [13:45:22] hmm... I see, in the lines ingested the bump is 5000 (on top of of 10000), but why didn't it show up in the access before that? [13:47:02] we were counting them as their own home [13:47:06] for some reason [13:47:10] https://grafana.wmcloud.org/goto/y-nUiGvvg?orgId=2 [13:48:52] so there were fixes deployed too? [13:50:49] anyhow, the change is actually not that big overall [13:51:36] would be nice to estimate the expected total usage though [13:51:49] how can we estimate it if it depends on what the tools do? [13:51:59] we can know the past, not the future :D [13:52:07] that's why it's an estimation xd [13:52:13] otherwise it would be just knowing [13:52:31] one tool can 2x the whole thing [13:52:41] if you get another pagepile [13:53:01] can we dedup that somehow? do we need to know each and every time any tool accesses any file? [13:53:28] do we have a space problem? [13:53:50] if the quota is reached it might impact users yes [13:53:50] it's taking half of what the logs in the other loki are using ,and those AFAIK are not even used [13:53:59] those are used yes, by users [13:54:20] toolforge jobs logs fetches them from there (unless that was reverted) [13:56:20] what's the total storage space in ceph? [13:57:30] we can do all the optimizations we want, but also we're talking about 25GB... [13:58:42] on ceph there's 211 TB left until problem arise (we increased it considerably lately it seems :) ) [13:59:32] how much do you think it will need? [14:00:13] if I can find the prometheus metrics I can estimate from the last 3 days and re-evaluate in a week or so [14:01:00] ack, thanks, I'll increase the quota accordingly [14:02:50] taavi: shouldn't loki_write_sent_bytes_total{job="infra-tracing-loki"} work on prometheus.svc.toolforge.org? unless I missed something your patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1212186 should have added those no? [14:05:33] volans: it should but there is a firewall problem (T407852) which I tried to look into again last week, but couldn't find any clean ways of getting those policies working [14:05:34] T407852: [infra,logging] prometheus failing to fetch the metrics endpoint - https://phabricator.wikimedia.org/T407852 [14:06:15] doh, somehow I missed that in my inbox, sorry [14:06:24] that was tricky iirc [14:10:56] volans: I'll bump the quota to 200G (2x), let me know if that's not enough when you have an estimate [14:11:22] sure, thanks! I really hope is more than enough [14:12:31] it will get rid of the alert for a while at least xd [14:12:37] we can always just ignore pagepile :D [14:15:33] hehehehe [14:37:12] I saved the value of loki_ingester_chunk_stored_bytes_total on all 3 write pods, will check again tomorrow to do a poor's man check if it makes any sense [15:44:37] I penciled in an 'about' page for the tools-infra team interface (made by volans) -- now there's a red link to the team interface for tools-platform https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Tools_Infrastructure/About [15:45:02] :D [15:53:29] I have upgraded the codfw1dev instances of cloudnet and cloudvirt to dnsmasq 2.92 (as mentioned last week, as a test ground for also moving dnsmasq 2.92 to "main" of trixie-wikimedia) [15:54:00] if someone is familiar how dnsmasq is used there, please test it or otherwise provide me with steps how to test myself [15:57:14] I believe that dnsmasq is used only for dhcp on new server creation. So our existing 'fullstack tests that test VM creation and DNS should also test dnsmasq, [15:57:50] those tests were still working at least up until 5 minutes ago. I'll check after another cycle or two. [16:01:26] yeah, still working fine. [16:03:07] moritzm: those dnsmasq processes on cloudnets are managed by neutron and iirc it's not totally obvious how to restart them so I'm going to reboot those nodes to make sure we're running what we think we're running [16:05:09] ok, thanks [16:06:08] restarting neutron-dhcp-agent.service should be enough [16:10:34] * andrewbogott swats a mosquito with a sledge hammer [16:18:35] moritzm: I think we're all good but I need to run to the doctor so we should check in again tomorrow before upgrading eqiad1 [16:19:57] sounds good, thanks. there's no rush at all [18:17:23] * dcaro off [18:17:26] cya tomorrow! [20:59:18] topranks, if you're around can you give https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238021 a glance? [21:04:36] andrewbogott: I've no objection in principal I guess, I replied back on task to ask taavi why this is needed though [21:04:46] I don't really think "because the old ones are" is a good rationale :P [21:04:50] but if there is one that is fine too [21:05:03] fair. I assumed it was to avoid rate limits or similar. [21:05:59] much better just to whitelist the cloud public IP if that's the goal [21:06:36] there may be something about attribution to specific cloudvps, but tbh that is no different than connecting to anything through the NAT, we ideally need a way to log what connected to what [21:06:58] I assume gerrit is still reachable for now? this isn't urgent this evening? [21:08:05] topranks: it appears the initial reason was https://phabricator.wikimedia.org/T335197 [21:09:22] it's not mentioned in that task that I can see [21:10:47] * andrewbogott following git back further and further... [21:12:00] ok... that IP first started getting tracked in https://gerrit.wikimedia.org/r/c/operations/puppet/+/675556 [21:12:43] but there's nothing gerrit-specific in the associated task [21:13:17] yeah that's the cloudgw initial config itself [21:13:19] yes, because that list is not a new concept with cloudgw, it was ported from a custom neutron hack to there [21:13:39] taavi: ok thanks yeah I thought it probably had some ancestor like that [21:14:11] the original list came from T267779, which also doesn't detail why exactly some tings were included [21:14:11] T267779: CloudVPS: detail list of dmz_cidr optional NAT addresses to avoid reaching everything in production with internal private VM addresses - https://phabricator.wikimedia.org/T267779 [21:14:27] do you know why we have gerrit on the list? [21:14:36] (before that list, everything in the wikimedia public space was excluded) [21:14:38] I know for the Wiki's the reason is for attribution of edits etc [21:14:47] ok [21:15:36] the original list pre-dates my time, so I can only guess that the idea back then was to include ~everything that things talked to, with the idea of reducing them as we go [21:16:07] having direct IP addresses in gerrit logs has helped with troubleshooting traffic sources in the past. but with NAT logging and v6 adoption, I'd be up with trying to retire it here [21:16:44] oh that had slipped my mind - you have NAT logging now? [21:16:45] that's great [21:17:08] yeah, since sometime last year iirc [21:17:10] do we want to try to retire it at the same time as the gerrit move? [21:17:17] yeah I've no major objection, in general I think the direction of movement should be for less things in wikimedia space being connected from the private VM ranges, but if we keep this it's not a big deal [21:17:54] but for now I think it'd be good to maybe trial it without the exception and see how things went [21:18:25] ok, so you're team 'see if it breaks during the move'? [21:18:27] that's ok with me [21:18:45] (mutante, I dragged you in here because earlier topranks was asking 'why now'?) [21:19:28] because it's the status quo that the gerrit IPs are in there and tomorrow they will change [21:19:40] I mean, why changing tomorrow [21:19:54] (also, note that the list in Puppet needs to match the allowlist in homer ACLs, otherwise traffic is going to get dropped entirely) [21:20:06] let it go through the NAT I say [21:20:26] I think that's fine with me, until we find a reason against. [21:21:20] andrewbogott: the real answer is "because scrapers could take down gerrit" and nothing else is holding us back now [21:21:40] what I wonder is what the "it" in "it breaks" really is though [21:21:48] jenkins? [21:22:12] I would assume it would be the CI [21:22:14] ok -- so shall I make a new patch that adds nothing and instead removes the old entries for gerrit? [21:22:56] my understanding is that jenkins agents talk to some zuul component on the contint boxes instead of talking to gerrit directly, although I can definitely be wrong there [21:23:11] andrewbogott: what does it do? [21:23:19] for the things that could break: libup, codesearch at least [21:23:26] we don't know, that's why we're proposing ripping it out rather than updating :) [21:23:58] well, what is bad about having it in there? [21:24:02] there was https://phabricator.wikimedia.org/T335197 which I think looked originally like it was the dmz but turned out not to be. [21:24:24] mutante: having this exact conversation again in 2028 is what's bad about it [21:24:43] mutante: so we are on a multi-year journey to not allow "privileged" access to Cloud VPS instances to anything on the wikimedia network [21:24:55] ultimately we have people we don't know running things on there [21:25:15] we don't necessarily need to take this moment to tighten up this particular rule [21:25:40] but I can't think of any real reason why it is needed [21:25:57] without it there traffic to gerrit goes out through the cloudgw and gets NAT'd to a public IP [21:26:03] which will work fine [21:26:30] now it sounds like we know that codesearch will break [21:28:32] topranks: you say "privileged" access but looking at that ticket for codesearch it sounds like that is "any" access? [21:28:52] is there a difference between those? [21:28:54] T335197 ? [21:28:55] T335197: codesearch is getting Failed to connect to gerrit-replica.wikimedia.org port 443: Connection timed out - https://phabricator.wikimedia.org/T335197 [21:28:58] "fatal: unable to access" [21:29:04] yea [21:29:08] I can't for the life of me see how that is in any way related to our discussion [21:29:52] oh I see the patch [21:29:56] there is a revert in there for the same file we are talking about [21:30:19] T335197 happened because the list was updated in puppet without updating the matching cr firewall filters at the same time [21:30:46] not because some hosts were (not)? included in the list at all [21:31:07] ah yeah that makes sense [21:31:10] thanks taavi [21:31:21] two scenarios work: [21:31:31] 1) not natted and explicitly permitted through on the CR ACL [21:31:34] 2) natted [21:32:28] so if we change something to not be natted, but the CR ACL doesn't have it (which seems to have happened in above task), it'll break [21:34:38] ok, so -- let's make a decision. Rip out the old dmz entry, or add the new ones? Sounds like mostly y'all vote 'rip out' is that right? [21:37:29] +1 from me. sorry to be a pain but I actually think it's makes things more simple and easier to manage over time (less T335197's) [21:37:35] T335197: codesearch is getting Failed to connect to gerrit-replica.wikimedia.org port 443: Connection timed out - https://phabricator.wikimedia.org/T335197 [21:38:13] wait, +1 on the proposed patch or +1 on ripping out? [21:39:15] topranks: ^ ? [21:39:51] I think t.opranks was in favor of NATting [21:40:25] first time for everything :D [21:40:36] but yes [21:40:42] ok! ty [21:40:55] can I bash that? :D [21:42:58] so.. if you want to remove gerrit IPs from that list. ok, I mean.. one thing we dont have to worry about in the future. but it leaves the following options: [21:43:18] a) remove gerrit IPs now and somehow test unknown things not breaking within the next few hours [21:43:41] b) see if it breaks tomorrow right after the switch-over and introduce extra uncertainty [21:43:46] I'm going to remove them now because that will be /slightly/ more revealing than letting it ride along with the cdn migration [21:44:06] c) post-pone the gerrit switch because of this [21:44:18] definitely no need for c :) [21:46:07] topranks, taavi, new proposal https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238036 [21:52:46] andrewbogott: that seems to be working at first glance [21:54:26] topranks: so something should happen right now in homer to match the change in puppet? [21:55:01] mutante: no, the other way around [21:55:16] if we merged the original patch we'd need to adjust the homer ACL to allow Cloud VPS IPs -> new gerrit-lb IPs [21:55:45] but now it the Cloud VPS IPs will be natted to internet IP, which is already allowed to everything on that ACL [21:56:15] I will prep a patch to remove the excpetion from the homer acl too, but we can look at it tomorrow [21:56:47] I wanted to suggest that we avoid a situation that again puppet is different from ACLs [21:59:05] the patch is applied on cloudgw nodes and I'm pretty sure codesearch is still happy. [21:59:25] So maybe there's a leftover/stray entry on homer for that rule that I just removed... [21:59:38] but otherwise I think we're good. [22:02:55] yeah it looks fine [22:03:16] thx topranks! [22:03:19] we have the removed the potential for the discrepancy Daniel mentions [22:03:25] and taavi [22:03:26] np! though I'm sure you're sorry you asked :P [22:03:32] and whoever is at the dinner table waiting for you [22:03:59] ok, thank you all [22:07:36] * andrewbogott has to feed the cat and go do a family tech support thing and then will come back and stare nervously at codesearch.