[00:52:13] I am working on some stuff for the offsite and realized that https://k8s-status.toolforge.org/ is not on https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Ownership. Should I WP:BOLD'ly add it? [00:53:16] https://wikitech.wikimedia.org/wiki/Tool:Fourohfour isn't on there either [01:19:28] bd808: yes please! [03:11:46] For anyone who is curious: here is a quick report I ran about membership of cloud-vps projects. https://docs.google.com/spreadsheets/d/1L0ZjH1FUH5XJK-bgs_rp6nPZLGymb_m2WtdP0E5eMgo/edit?usp=sharing I wouldn't put those stats on a slide without closer analysis though! [10:42:17] arturo: thank you for all the work on the incident doc and follow-up items! [10:42:29] you are welcome! [12:49:14] please carefully review this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098498 this is one of the causes of the outage the other day [12:53:07] is T380833 in our radar? [12:53:08] T380833: [harbor] some artifacts and projects seems to have gone missing - https://phabricator.wikimedia.org/T380833 [12:59:06] that's not one of the causes of the outage [12:59:22] it was surfaced during it, but it's unrelated afaik [12:59:45] oh, wait, you mean the patch, not the task? [12:59:52] I'm confused xd [13:00:28] Raymond_Ndibe: started looking into the task, that is unrelated to the outage (afaik), looking into the other patch (that is related to the outage) [13:03:44] yeah, the radar question was related to the harbor. I just liked the ticket to the outage, because the description says so, but I can unlink if they are not really related [13:07:43] it's ok, we noticed during the outage and it prevented some tools from restarting correctly, seems reasonable to link the tasks (even if it was not the cause) [13:07:46] * dcaro lunch [13:10:06] patch merged and deployed (without outage!) [13:23:26] yay! [13:48:14] what was this recent spike in failed deployments? https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&from=now-1h&to=now&var-cluster_datasource=prometheus-tools (looks like just a blip but I'm curious) [13:53:48] I have no idea! [13:57:57] * arturo food [13:59:59] looking at some of the runbooks, there seems to be a few outdated links. e.g. https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsApiUpMetricUnknown, the links to Karma UI (is this something we still use) and both tools/toolsbeta prometheus [14:00:31] should those be links to grafana? [14:03:25] no, prometheus is what behind grafana, it points there as the source of truth for prometheus metrics is there, from the prometheus you can see the scrapes (if they failed), and the alerts that are defined [14:03:46] that alert is about the metric not being there, so you have to check why prometheus did not add it [14:04:25] ok, both of those links are broken though. what are the current ones? [14:04:44] oh, should be https://tools-prometheus.wmflabs.org/tools one of them [14:05:32] hmm, shouldn't wmcloud.org be the right domain? [14:07:34] I'll add that one as it's the new domain we should be using in the proxies [14:07:45] and karma UI link should be this? https://alerts.wikimedia.org/ [14:07:55] wait, this works https://prometheus.svc.toolforge.org [14:08:08] yep, that's the one we use (prod one) [14:10:10] the svc one leads me to the apache2 debian default page [14:10:54] all do, unless you pass the extra /tools, updated the pages [14:11:02] (envvars, jobs and builds api) [14:11:24] we could fix that in the apache config [14:12:45] aha, we were editing at the same time [14:13:12] sorry xd [14:13:24] feel free to write on top of my changes [14:13:33] I might have :) [14:14:34] np [14:15:14] we need a bot that alerts when links are broken on wikitech :)) [14:15:56] I added an 'exploration' section to the incident wiki page, I plan on adding there more details on getting the causality of the incident (from the 3 main issues, doing a "this was caused by" noting if that link is strong or a guess), feel free to add/change/etc. with your notes [14:16:18] it kinda rings a bell [14:16:20] (the bot) [14:21:46] * dhinus paged Cloudvirt node cloudvirt1061 is down [14:23:34] arturo: this is probably related to the maintenance mentioned in T380673 [14:23:35] T380673: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673 [14:24:26] I will add a downtime with "sre.hosts.downtime --days 14" [14:25:51] I think this is the first situation where the new "kernel error" alert has prevented a wider outage [14:26:56] that server had a kernel error yesterday, it triggered the alert and it was manually depooled [14:37:12] there's a link about the missing harbor image thing in the backscroll but no response as far as I can see... is anyone investigating that? (Or, alternatively, can someone explain to me why it isn't extremely serious?) [14:37:17] (also, good morning!) [14:40:42] andrewbogott: Raymond_Ndibe started looking into it, it's not a generalized issue (affecting <10 tools), you can always rebuild your image (worst case scenario) [14:41:06] (I think I replied :/) [14:43:22] dcaro: oh, you did, there was a : in your response so I scrolled past it as not directed to me :) [14:43:26] <10 tools is reassuring [14:59:24] dhinus: thanks for T380673. Yes, a good catch by the detector [14:59:25] T380673: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673 [14:59:50] I used a cookbook to drain the HV, so maybe I did not use the right one, the one that should include the downtime [15:10:29] I think you used the right cookbook, I got the same alert when I rebooted a different one on Monday. Something is a bit wrong with the timing in the cookbook I think [15:34:31] the drain cookbook did not alert, but today DCops rebooted it again because we told them it was depooled, and that caused the alert [15:44:09] new DNS failures today: T380991 [15:44:10] T380991: Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org (2024-11-27) - https://phabricator.wikimedia.org/T380991 [16:07:18] Fresh resolver issues in the integration project -- T380991 [16:07:19] T380991: Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org (2024-11-27) - https://phabricator.wikimedia.org/T380991 [16:09:14] 100% guess, but it feels reasonable to me that CI does a lot of DNS lookups and people pay attention to failed jobs so this is less likely to be integration project specific and more likely that resolver failures are noticed there and reported. [16:23:38] aahh, it's the same ticket dhinus shared, yep, I agree it's probably not project-specific [16:28:28] maybe the recursors can be offloaded with a local cache? [16:28:41] I have long lost track of how DNS caching is managed [16:29:23] I would really like to find the root cause of recursor issues, it seems they are getting more frequent, and even if we work around them, the will bite us later [16:29:43] +1 [16:29:45] it's possible it's a more general network issues, and recursor issues are the symptom [16:30:23] there are no clear or obvious indications in the logs or in the metrics that the DNS recursors are malfunctioning [16:30:37] at least I don't see anyt [16:30:46] maybe the requests are not even getting to the recursors? [16:31:58] but the servers get 2K/reqs, why would some arbitrary requests not get there? [16:32:11] and I assume clients would retry anyway [16:32:32] no idea... [16:32:44] so, assuming a client retries 3 times before reporting a timeout, why would 3 requests in a row not get to the server? [16:38:13] The recursors could move to VMs and we could have a lot more of them, if it turns out to be a capacity issue. I'd like to rule out networking issues first though. [16:39:43] T305834 would be an easy way to reduce the recursor query volume by a large percentage [16:39:44] T305834: Cloud VPS: drop wmflabs names from profile::resolving::domain_search - https://phabricator.wikimedia.org/T305834 [16:40:37] I think there's still a handful of vms with wmflabs domains but we're close to rid of them [16:40:47] * andrewbogott starts with 'watch -e dig gerrit.wikimedia.org' [16:40:57] xd [16:41:01] there's one shutdown that we could delete anytime [16:41:18] (and a bunch of related patches to remove support for those that I'd like reviews for) [16:41:32] how old are wmflabs VMs? [16:42:04] more of a statement than a question: how old are wmflabs VMs! :-P [16:42:15] the last .wmflabs VMs date from just after the buster release [16:44:05] taavi: do you already have a command handy for spitting out all existing .wmflabs domains? If not I'll invent one. [16:44:19] os record list 114f1333-c2c1-44d3-beb4-ebed1a91742b --sudo-project-id noauth-project [16:44:39] (that's everything expect .svc.eqiad.wmflabs) [16:44:52] thx [16:45:24] deleting deployment-cumin is T380678 [16:45:25] T380678: Re-create deployment-cumin - https://phabricator.wikimedia.org/T380678 [16:47:02] basically the only problem here is ensuring the 'tools-db' and 'tools-redis' short names keep working in toolforge [16:47:43] deployment-cumin has been shut off since Nov. 24, 2024, 12:47 p.m, that seems promising (maybe you did it?) [16:47:57] yes :-) [16:48:01] great! [16:50:57] taavi: do you have a gut feel for how many tools might fail hard if we break resolution on those ancient tools-db and tools-redis bare hostnames? I know I have been an advocate of dragging around legacy aliases pretty much forever, but I can also see that sometimes breaking things for a bit is the better option. [16:55:22] could we potentially grep tools nfs and actually find uses? [16:55:47] probably too many :( https://codesearch.wmcloud.org/search/?q=tools-%28db%7Credis%29%28%24%7C%5B%5E%5C.%5D%29&files=&excludeFiles=&repos= is showing a non-zero number of hits, and I suspect there's a far greated number in older tools that don't use git hosts that codesearch indexes [17:01:42] ugh, yeah [17:04:17] andrewbogott: `watch -e dig` would most probably not catch it :/ [17:04:34] hashar: have a thought about something on the cli that might? [17:04:40] I mean [17:04:49] watch defaults to running the command every two seconds [17:04:55] btw, it did just catch it! Maybe [17:04:59] https://www.irccloud.com/pastebin/JO40gywa/ [17:05:02] OHHHHHHHHHHHHHHHHHHHhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh [17:05:09] ahah [17:05:17] andrewbogott: I added a not to the task, our k8s api pods are also seeing the issue [17:05:20] * hashar files a lottery ticket [17:05:23] (in tools) [17:05:34] andrewbogott: I stand corrected! :b [17:06:56] https://phabricator.wikimedia.org/P69462 [17:07:00] that is what I came up with last time [17:07:06] I truly can't explain why 3 timeouts in a row. There are no packet drops reported anywhere, no? [17:08:08] that is 8 threads doing gethostbyname() with a 250ms sleep and assumes that it would throw an ,exception on error [17:08:48] hashar: thanks, that might be quicker [17:10:11] I don't think it caught anything though. Then maybe python 3.9.2 socket.gethostbyname() retries by itself/has a long timeout or whatever [17:11:52] Lock to allow python interpreter to continue, but only allow one [17:11:52] thread to be in gethostbyname or getaddrinfo */ [17:11:52] #if defined(USE_GETHOSTBYNAME_LOCK) || defined(USE_GETADDRINFO_LOCK) [17:11:52] static PyThread_type_lock netdb_lock; [17:11:52] #endif [17:12:06] socket.gethostbyname should end up using the system's resolver cache, likely sssd in Cloud VPS [17:12:08] mouahah there is a global lock... so my threads are all waiting on each other [17:12:20] (maybe) [17:14:45] please review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098571 [17:15:23] andrewbogott: That paste catching a failure with `watch dig` reminded me of this quip from Moriel about debugging a mediawiki-vagrant problem that was vexing here -- https://bash.toolforge.org/quip/AVteW6igQMK9DA-FLEiM :) [17:21:05] as long as it's happening at all, I'm happy that it's happening a lot [17:21:49] taavi: you already added a hiera override for toolforge? [17:22:16] not needed, the "%{::wmcs_project}.eqiad.wmflabs" entry (not touched in that patch) keeps the tools-db/redis names working there for now [17:22:25] feel free to quip my claim of "that is not going to work" only to be proven wrong a minute later :b [17:22:38] taavi: oh sure, ok [17:23:32] hashar: :) that was the best kind of being wrong [17:24:01] hashar: right as you were saying that I was thinking '2 seconds is way too slow, I need to shorten that' and then it failed [17:25:24] OR [17:25:38] I just had a random thought about how amazing it would be to be able to search all of the toolforge tool log output for things like these failures. #someday [17:25:39] so yeah lot of luck :b [17:26:02] I assume dig does a direct query over the network, so at least that shows up there was a timeout [17:26:19] that can probably be pasted to the task for later reference [17:29:03] I am kind of wondering what constitutes a time out when dig should be using UDP for the lookups per it's man page [17:29:28] timeout waiting for a response?* [17:29:37] yeah, the watch output says ' SERVER: 172.20.255.1#53(172.20.255.1) (UDP)' [17:29:50] I guess a "I sent a packet but never got a matching datagram back" [17:30:15] I'd have the same guess yes (could be that the packet was lost, or ignored) [17:30:48] which can be the recursor never repliying [17:30:52] or one of the packet being lsot [17:30:53] lost [17:30:54] there are a lot of things between dig and that recursor :/ [17:31:07] "This option sets the timeout for a query to T seconds. The default timeout is 5 seconds. An attempt to set T to less than 1 is silently set to 1." [17:31:14] And I think I just got lucky the first time, trying again with .1s interval and no failure yet. [17:32:36] ah, there it goes [17:32:49] * hashar awards the barn star of reproducibility [17:35:05] probably unrelated but the recursor log just threw out a storm of [17:35:06] cloudservices1005 pdns-recursor[122165]: msg="Ignoring empty (qdcount == 0) query on server socket!" subsystem="in" level="0" prio="Error" tid="3" ts="1732728865.038" proto="udp" remote="172.16.6.46:33794" [17:35:36] that remote is integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud [17:35:42] yeah [17:36:07] it didn't cause my watch to fail though [17:38:23] that might be me [17:39:00] yeah sorry, I have tried emitting a bunch of empty udp packets to check whether there were packet loss or latency spikes [17:39:03] that was dumb [17:39:28] np! [17:39:31] Now I see [17:39:32] cloudservices1005 pdns-recursor[122165]: msg="Sending SERVFAIL during resolve" error="Too much time waiting for www.t-ds.info|AAAA, timeouts: 5, throttles: 1, queries: 6, 7513msec" subsystem="syncres" level="0" prio="Notice" tid="1" ts="1732729156.698" ecs="" mtid="13153531" proto="udp" qname="www.t-ds.info" qtype="AAAA" remote="172.16.6.122:46225" [17:39:46] oh that's someone trying to resolve a fake domain, normal [17:43:26] yep, there's a bunch of those in the logs, but did not find any for gerrit (for example) last time I checked [17:43:34] (couple days ago) [17:44:03] there were some errors about some response from upstream DNS having the wrong size/format or something (it was a bit cryptic) [17:45:25] taavi/arturo can you tell me more about how 172.20.255.1 balances between the two backends, and if there's a way for me to track when/if it flips from one to the other? [17:45:49] getg, good luck [17:45:52] * dcaro off [17:46:06] basically the in the same rack as the active cloudgw node gets all the traffic [17:47:26] so it only flips if cloudgw also flips [17:47:33] yeah [17:54:59] so if one goes down, it will not flip unless the cloudgw flips? [17:54:59] * andrewbogott waiting to see if it still fails after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098571 [17:56:22] sorry for hijacking the conversation for something (hopefully) easier: do you know why horizon is not showing me the floating ip assignments for "project-proxy", but the CLI is? [17:56:43] the "Action" buttons are also not visible, they work in project "tools", but not in "project-proxy" [18:03:55] dhinus: are you a project member or just observer? [18:04:09] hashar: this is a long-shot but any change the dns issues date all the way back to https://gerrit.wikimedia.org/r/c/operations/puppet/+/956419 ? [18:04:20] andrewbogott: I've just added myself as member, but it didn't make a difference. Maybe I need to log out though. [18:04:31] just switch to a different project and back [18:05:33] *any chance* [18:06:25] andrewbogott: slightly better, I can see the buttons now. I cannot see the value for one of them though, maybe because it's not matching any instance? [18:07:02] dhinus: what panel are you looking at? [18:07:07] https://horizon.wikimedia.org/project/floating_ips/ [18:07:12] project: "project-proxy" [18:07:29] CLI tells me both are assigned: wmcs-openstack floating ip list --project project-proxy [18:08:07] ok, I see one mapped and one unmapped in the ui [18:08:25] yes, if it's mapped to a VM that doesn't exist anymore it probably doesn't know what to show you there [18:08:50] the fun part is I can use that floating IP, i.e. I can ping it or curl it [18:09:12] unless I'm missing something, it's the main floating IP for the web proxy [18:09:14] one of them is mapped to a neutron VIP, one to an instance [18:09:28] taavi: ah-ha, where is that mapping? [18:09:56] the CLI command you pasted shows it [18:10:04] so that's probably a smallish horizon bug, it should show the IP even if it can't attribute it to anything... [18:10:41] I mean, how do I configure the "neutron VIP"? I was just trying to find out which proxy is active [18:11:05] and the CLI tells me it maps to 172.16.5.140, where do I go next? :) [18:11:07] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Keepalived [18:11:13] thanks! [18:12:42] that's the wikitech page I was missing. I'll add a link from https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy [18:25:08] andrewbogott: the first report I was aware of is https://phabricator.wikimedia.org/T374830 which is from September [18:25:19] and I don't think I heard of it before that [18:26:16] maybe it is a rate limiting :) [18:28:29] andrewbogott: I raised T381021 [18:28:30] T381021: [horizon] Floating IP pointing to Neutron VIP is not displayed - https://phabricator.wikimedia.org/T381021 [18:40:21] taavi: I added some info to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy but I'm a complete noob with keepalived, feel free to suggest better ways [18:40:34] I assume there's some bits in puppet that might be worth linking but I haven't looked them up yet [18:43:21] hashar: ok, so that's a whole year later than the switch to bgp [18:43:27] dhinus: ok! [18:43:57] maybe the network part can be tested by having a daemon listening on an UDP port that would replyto incoming packet [18:44:19] then from a WMCS instance send packets to the cloudservices machine and see whether there is some packet loss [18:44:31] * dhinus logging off [18:44:36] but I imagine that would show up somewhere in graphs [18:45:26] hashar: my watch test is no longer failing, which suggests that it may be just a load issue. Can you please let me know if/when you see that failure again? [18:45:47] (on T380991 if you remember) [18:45:48] T380991: Various CI jobs failing with: Could not resolve host: gerrit.wikimedia.org (2024-11-27) - https://phabricator.wikimedia.org/T380991 [18:45:57] I don't seem them, but devs would ping the task or file one when it happens :) [18:46:50] https://grafana.wikimedia.org/d/000000044/pdns-recursor-stats?orgId=1&from=now-6h&to=now&viewPanel=3 shows a significant drop in load after that one patch was merged :-) [18:47:51] also two days ago :/ weird [18:52:56] andrewbogott: https://phabricator.wikimedia.org/P71234 [18:53:14] those are the Quibble build logs that match 'Could not resolve host' [18:53:19] over the last 3 / 4 days [18:54:00] maybe there is a rate limiter [18:54:05] packet throttling per project [18:54:12] shouldn't be [18:54:18] if that's from just now then it looks like things are better? [18:54:42] last one at 17:09 [18:55:03] 90+ minutes ago, right? [18:55:05] so yeah that is "solved" [18:55:07] yeah [18:55:08] UTC [18:55:12] (sorry timezones are a pain) [18:55:42] last time that got "solved" by switching the service from a cloudservices to another one [18:55:57] and I think the DRAC / kernel hhave been updated [18:56:03] but the root cause was not found [18:56:19] which honestly is fine to me. It is not like those kind of issues are easy to find [18:56:56] oh yeah, I'm not saying it's fixed for good, just thinking it might be a load issue rather than network flaking (which is what I was thinking until an hour ago) [18:57:12] yup [18:57:51] taavi: what are your thoughts about selectively removing - "%{::wmcs_project}.eqiad.wmflabs" from search (e.g. everywhere but tools or alternately just in integration + deployment-prep)? [18:57:57] Of course tools is probably 90% of the load [19:15:02] the only thing I notice is packets being dropped on cloudgw1002 on enp101s0f0np0 [19:15:07] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=cloudgw1002&var-datasource=thanos&var-cluster=wmcs&from=now-6h&to=now&viewPanel=11 [19:17:02] which started on sunday [19:17:05] so maybe that correlates [19:17:14] aka something is off in the network / cloudgw [19:17:19] causes the udp packets to be dropped [19:18:10] the packets emitted by the instances never reach the server [19:18:13] or the response is never sent [19:18:19] you're right, that doesn't look great [19:18:32] the client side eventually stop waiting and flag a timeout [19:18:44] doesn't explain why the retry would not get it [19:20:58] ah https://phabricator.wikimedia.org/T374830#10206549 [19:21:14] so the cloudgw server had some network card issue [19:21:27] traffic got moved to cloudgw1001 [19:21:38] and things got resolved [19:21:47] so I am inclined to state that is the same trouble happening again [19:22:03] the task was https://phabricator.wikimedia.org/T376589 [19:22:37] topranks: still around? [19:22:42] pretty sure arturo is not [19:26:15] I'm here yeah [19:26:22] what's up? [19:26:25] packet drops [19:26:28] * topranks reading scrollback [19:27:49] It doesn't look as serious to me as T376589 but still not great [19:27:49] T376589: cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589 [19:28:09] so yeah last time there was: [Mon Oct 7 07:17:27 2024] NETDEV WATCHDOG: enp101s0f0np0 (bnxt_en): transmit queue 4 timed out [19:28:16] Arturo rebooted the server [19:28:25] and had the idrac updated [19:28:29] so maybe it is a kernel bug [19:30:00] then when in the trace there is `asm_exc_invalid_op` that sounds terrible :b [19:33:21] there are some receive errors showing on that NIC but they aren't incrementing at the moment [19:33:25] https://www.irccloud.com/pastebin/JQpgimCl/ [19:36:16] sry 1002 is the active one right now [19:44:33] there are some small increments there [19:46:06] tbh not seeing any smoking gun as such [19:46:28] topranks: is there anything in dmesg? [19:46:45] the system may just be approaching limits in terms of packet processing, not sure how much optimisation has been done on it for high throughput [19:46:53] let me see [19:47:07] topranks: do you mean nothing to be concerned about, or just nothing to worry about? I was only worrying because https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=cloudgw1002&var-datasource=thanos&var-cluster=wmcs&from=now-7d&to=now&viewPanel=11 shows new behavior as of a few days ago [19:47:15] yeah there are these [19:47:16] [3354093.871467] bnxt_en 0000:65:00.0 enp101s0f0np0: Received firmware debug notification, data1: 0xdd03, data2: 0x0 [19:49:53] earlier in October Arturo found: NETDEV WATCHDOG: enp101s0f0np0 (bnxt_en): transmit queue 4 timed out [19:50:49] yeah so I'd need to look in more detail but that could just be cpu unable to keep up with the arrival rate [19:51:11] *however* it does look worse in recent days, and the throughput looks relatively the same as it was prior to that [19:51:41] even though the CPU usage looks flat? [19:57:22] well that's kind of what I mean, it's dropping packets now but throughput seems much the same as does CPU [19:59:50] the only thing I see is the softirq metric https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=cloudgw1002&var-datasource=thanos&var-cluster=wmcs&from=now-7d&to=now&viewPanel=3 [20:00:45] which was at 20/25% this morning between 9:00 UTC and noon UTC [20:00:50] with spikes here and there [20:01:08] but if all those come from the network card and are bound to a single CPU [20:01:12] and the machine has multiple CPU [20:01:27] then then 20%/25% would be a single CPU saturating on those soft IRQ requests [20:01:33] something like that [20:03:16] but that does not align with the build failure encountered by CI https://phabricator.wikimedia.org/P71234 [20:03:17] anyway [20:03:38] it is 9pm, it is about time to have dinner :] [20:05:10] yeah I gotta log off also [20:05:21] usage definitely seems to be growing looking at past few weeks [20:05:45] the ingress packets should be distributed to multiple cores, based on a hash of packet and the nic driver setup [20:05:54] I wish we had the details of those softirqs [20:06:03] or some graph of the network card business/queues etc [20:06:31] thanks topranks ! [20:06:32] but as you say whether or not things are optimally tuned for the number of cores and what CPU is handling is rx queue is the thing [20:06:40] yeah [20:06:41] there are also NUMA considerations at this kind of bandwidth [20:06:45] https://usercontent.irccloud-cdn.com/file/ISDY3Gzp/image.png [20:07:00] I remember in 2017 when b.black explained me a challenging issue with IRQ usage and network [20:07:04] I messed with the graph to get a better sense, but we have more spikes since previous weeks [20:07:07] on some LVS machine that was saturating [20:07:46] yeah, softirq will max out due to packet processing [20:08:07] and NUMA is the bus between the CPU cores ? [20:08:15] exactly yeah [20:08:23] the NIC is connected to one NUMA node [20:08:25] ahah [20:08:34] if a CPU on the other one reads/writes a packet it has to cross the "numa bridge" [20:08:42] which is fine at lower IO rates [20:08:45] so that course Brandon gave me some years ago paid off in the future! [20:08:51] but as it rises that's an internal bottleneck [20:08:52] haha yeah [20:09:03] I am so glad we have experts at the foundation [20:09:08] and you're fully correct the irq's may only look small, but if it's maxing one cpu we could be in trouble [20:09:13] at most org, people would just spin up a larger AWS instance [20:09:16] as in.... the message you posted mentioned queue 4 [20:09:19] haha yeah [20:09:25] this was very busy this morning [20:09:27] https://usercontent.irccloud-cdn.com/file/illvm1wi/image.png [20:09:35] (or worse add a cronjob to reboot every day) [20:09:56] hahaha [20:10:08] I also note if I look at 'htop' the conntrack is using a lot of CPU [20:10:25] which dashboard is rendering those? [20:10:44] I was messing with the host dashboard but I don't want to save it [20:10:49] ah ok [20:11:11] i basically mutliplied by 8 to see the bits/sec and changed a few view settings [20:11:27] is conntrack involved for UDP packets as well? [20:11:27] topranks, hashar, thanks for looking but also you should go eat dinner if there's no immediate cause for alarm :) [20:11:37] yeah true [20:11:39] you are wise [20:11:54] yeah it is - and in some ways worse for udp as they just sit there until the box hasn't seen a packet for X mins (configurable) [20:12:31] [20:13:54] https://fr.wikipedia.org/wiki/Le_Cri#/media/Fichier:Edvard_Munch,_1893,_The_Scream,_oil,_tempera_and_pastel_on_cardboard,_91_x_73_cm,_National_Gallery_of_Norway.jpg [20:14:27] lol [20:14:41] total conntracks through the box are relatively steady, despite the increased throughput [20:15:11] I'll have a chat with arturo in the morning, one thing for certain we are at traffic levels where considering some of that tuning is probably needed [20:15:28] though I can't for sure say what the cause is [20:15:45] to be continued [20:15:54] meanwhile, I am going to have my crepes ( https://fr.wikipedia.org/wiki/Galette_de_sarrasin_(Haute-Bretagne)#/media/Fichier:Galette_de_sarrasin_compl%C3%A8te_bretonne.jpg ) [20:16:01] it is about time. Thank you andrewbogott and topranks ! [20:16:10] enjoy! [20:17:17] * andrewbogott never totally convinced by buckwheat crêpes [20:19:45] * andrewbogott should probably learn to like buckwheat in preparation for the coming famine [21:39:57] Hiyo. I'm wondering if we need to have a Tech News (newsletter) entry, for the WMCS Network Problems incident? (especially if there are any lingering after-effects that tool-users might be experiencing, or follow-up actions needed from tool-maintainers, beyond what you've gotten from the wikitech-l/cloud-announce emails.) [21:39:57] If you don't think an entry is needed, that's fine, just let me know. [21:39:57] If you do think an entry is needed, please help me by drafting some wording, either here or directly in https://meta.wikimedia.org/wiki/Tech/News/2024/49 -- It could be purely informational (e.g. "Last week, some Toolforge tools were unavailable for a few hours because of DNS resolution issues. The errors have been resolved and investigation is ongoing.") or could include follow-up/feedback requests ("If you notice any [21:39:57] DNS-related issues, please report them in [[phab:T380844|the Phabricator task]]"). Thanks! [21:39:58] T380844: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844 [21:44:17] quiddity: there aren't lingering after-effects as far as we know, so I don't think an entry is probably needed. [21:44:44] Sounds good, thanks for confirming :)