[07:55:48] hi, last night I flagged a task unbreak now https://phabricator.wikimedia.org/T374830 [07:56:27] cause apparently DNS requests fail transiently and there are some cases for which it happens on multiple hosts at the same time :D That leads me to believe there is an issue with the WMCS DNS recursor :-] [07:57:10] (I also don't know why the local instance don't serve a stall response from cache in that case, but I guess that is how DNS works, once an entry has expired, it is expired) [07:58:08] 👀 though I have a meeting in 5 min [07:58:15] yeah no worris [07:58:23] that has been happening for at least 15 days :] [07:58:39] maybe UBN might be not the serverity then xd [07:59:01] is there a "super high, but not urgent" ? [07:59:07] I only marked it ubn yesterday cause at least to people asked about it and to make sure it is not forgotten [07:59:08] yeah [07:59:14] let me lower the prio [08:00:30] it is high now :) [08:01:22] thanks! [08:06:42] I can take a look [08:07:09] I don't anything about the dns system anymore :/ [08:08:10] it seems to happen on any hosts, that is transient and yesterday there was a case of two requests failing at the same time from two different hosts [08:24:23] hashar: at which time did it happen yesterday? I would like to correlate with the SAL [08:25:37] arturo: there were issues during the EU evening deploy window [08:25:58] RhinosF1: do you have an UTC time frame for that? [08:26:10] arturo: 20:00–21:00 UTC [08:26:47] I can get a bit tighter window if needed from -operations logs [08:26:48] thanks [08:27:21] <_joe_> I don't think it's very probable the issue is anything but networking in openstack tbh [08:27:43] <_joe_> I left a suggestion to try to switch name resolution to TCP on a few hosts to see if failures are rarer in that case [08:29:44] cscott pasted a bunch of build that failed in a row at around 20:13 UTC yesterday [08:29:53] https://phabricator.wikimedia.org/T374830#10198805 [08:30:42] and earlier yesterday Aharoni mentioned some build failing https://phabricator.wikimedia.org/T374830#10197529 [08:31:14] two of them had DNS resolution issue at 17:56:23 from two different hosts, which led me to rule that the recursord has some problem [08:31:19] _joe_: in your hunch, what kind of networking problem would that be? simple packet loss ? [08:31:20] or well, network is flappy :D [08:40:13] <_joe_> arturo: that's the simplest explanation, yes [08:40:38] <_joe_> I would not expect it to be a firewall [08:41:19] <_joe_> or maybe it's a dns recursor issue and we notice it with CI because that's the thing that makes most dns queries [08:41:40] another point is that the host lookup always happen when doing the git clones [08:41:47] which is done with up to 8 threads in parallel [08:42:25] then I guess the rest of the builds barely do dns lookup or they retry on errors maybe [08:47:07] turns out, our DNS resolver has an anycast address. The anycast healthchecker uses a query to www.wikimedia.org to verify its health [08:47:33] there is not a single health check failure on the logs, as far as the logs go [08:47:48] the check is done every second [08:48:03] the query always returns in about 100ms [08:48:12] example: [08:48:15] https://www.irccloud.com/pastebin/wflwEoju/ [08:49:24] I also review how many rec DNS queries are we serving in the time frame mentioned, and it is about 2k req/s, no spikes or anything weird on the graph [08:49:26] https://usercontent.irccloud-cdn.com/file/qLnWKyBB/image.png [08:50:04] all other edge network metrics are just fine [08:50:42] we have been reshuffling ceph data lately, I wonder if switch port saturation (a real thing) could result in packet loss [08:52:09] cloudservices1005 is in D5 [08:54:00] topranks: could you help me understand this graph? https://librenms.wikimedia.org/graphs/to=1727945400/device=185/type=device_bits/from=1727859000/ is the switch saturated between 19:00 and 20:00 ? [08:54:28] ceph finished reshuffling two days ago (it was not rebalancing tonight) [08:56:08] I don't see any suspicious spike on the drops either [08:56:11] https://usercontent.irccloud-cdn.com/file/SfH5Hagv/image.png [08:57:10] (that would get to thousands/s when it starts dropping) [08:57:27] discard is the same as drops? [08:58:04] I think so yes [08:58:45] I'm also scanning gnmi stats on grafana [08:58:45] https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/network-interface-throughput-gnmi?orgId=1&refresh=2h&var-site=eqiad%20prometheus%2Fops&var-device=cloudsw1-d5-eqiad&var-interface=xe-0%2F0%2F41&from=now-2d&to=now [08:59:31] gnmi stats are so much nicer [08:59:38] https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/network-interface-throughput-gnmi?orgId=1&refresh=2h&var-site=eqiad%20prometheus%2Fops&var-device=cloudsw1-d5-eqiad&var-interface=xe-0%2F0%2F41&var-interface=et-0%2F0%2F52&var-interface=et-0%2F0%2F53&var-interface=et-0%2F0%2F48&var-interface=et-0%2F0%2F49&from=now-2d&to=now [08:59:47] I don't think anything obvious by looking at those graphs [09:00:11] which in my mind confirms the data from the anycast healthcheker [09:02:43] hmm... I don't seem to find the discards for regular interfaces using gnmi :/ [09:17:44] I pasted timestamp of all builds having the could not lookup host: https://phabricator.wikimedia.org/T374830#10198891 [09:36:04] that is awesome, thanks [10:04:57] * dcaro lunch [10:10:56] fb u [10:11:53] eh... sorry [10:11:56] * topranks looking [10:12:18] oh wow we've gnmi for cloudsw??? [10:12:51] luca fixed the cfssl error awesome :) [10:13:51] hmm there are little gaps I see, might need to look at the config [10:14:39] arturo: the "total traffic" graph isn't really that useful to look at [10:14:48] ack [10:14:54] the switch is certainly not saturating if the total traffic through it is 25G [10:15:07] *however*, if that is coming from one or two interfaces then maybe it is [10:15:11] we need to look at the interfaces [10:16:14] https://grafana-rw.wikimedia.org/d/5p97dAASz/network-interface-queue-and-error-stats?orgId=1&var-site=eqiad%20prometheus%2Fops&var-device=cloudsw1-d5-eqiad&var-interface=All [10:16:19] ^^ this one is good [10:16:47] top panel shows there have been no tail drops on it in the past 24h which is good [10:16:57] ack [10:18:01] there are quite a few "red drops" however, on the links from it to E4/F4 [10:18:12] what does red drop means? [10:18:22] https://grafana-rw.wikimedia.org/d/5p97dAASz/network-interface-queue-and-error-stats?orgId=1&var-site=eqiad%20prometheus%2Fops&var-device=cloudsw1-d5-eqiad&var-interface=et-0%2F0%2F52&var-interface=et-0%2F0%2F53 [10:18:37] * arturo reads https://en.wikipedia.org/wiki/Random_early_detection [10:18:43] ha was just about to link [10:18:57] it basically means that you're getting to 90% or more of utilization on the link [10:19:09] the switch starts dropping _some_ packets before it max's out [10:19:45] this - assuming the traffic is TCP - is intended to alert the TCP congestion control algo on the end hosts that there is congestion, and cause them to back off the send rate prior to us hitting total saturation and complete stop [10:20:52] but yeah - basically at some points in time the usage on those links was exceeding capacity [10:21:24] it's worst on the link to E4, 0.00148% drop rate in total [10:21:27] so fairly small but still [10:24:53] it's fairly consistent happening all the time really [10:27:18] does the original question relate to to the dns timeouts?? [10:27:43] I think those drops are unlikely to be the cause of that. assuming a relatively sane dns client (which would retry) anyway [10:28:14] the change of two subsequent dns requests from one host being in the 0.00148% of drops seems unlikely [10:28:39] I see [11:15:48] hashar: could you please try setting `profile::resolving::timeout: 5` in horizon project puppet, to see if that results in anything different? [11:17:41] I am in an interview currently ;) [11:22:49] something very wrong if a dns query is taking 5 seconds to get an answer [11:23:19] I know, but we are currently doing 1, which is 5 times less that the default [11:23:24] short timeout + retry is better if that is possible [11:23:26] 1 is fine [11:23:33] if there is no answer in 1.... there will never be one [11:23:38] but I hear you, it might be worth a shot [11:52:42] _joe_'s suggestion of using tcp should help if the issue is the dropping of packets no? [11:52:50] (tcp would just retry itself) [11:55:09] arturo: dcaro: feel free to apply any DNS setting you feel is relevant to the issue [11:55:31] I am unlikely to be able to follow this afternoon, I have too many video calls unfortunately :/ [11:56:15] using TCP sounds worth checking as well, that needs some Puppet chnage to adjust the current resolv.conf.erb to support that [11:59:43] * dcaro ni meetings too [12:10:57] <_joe_> dcaro: yes that was the hypothesis I wanted to verify or falsify [12:29:19] I also found a message complaing about a "Could not resolve host" issue on September 16th at https://phabricator.wikimedia.org/T365116#10150953 [12:29:45] and https://phabricator.wikimedia.org/T374830 was filed the same day [12:32:47] hashar: that one is unrelated [12:32:54] deployment-imagescaler03.deployment-prep.eqiad.wmflabs does not exists as a FQDN [12:33:48] * f0cb21ad - Delete data for deployment-prep deployment-imagescaler03.deployment-prep.eqiad1.wikimedia.cloud (1 year, 7 months ago) | [12:33:57] I should have checked. Sorry! :D [12:40:49] np [12:57:59] topranks: do you know off hand what is the best way to add an additional IPv6 to a server interface via puppet? this is for cloudgw [12:58:12] `interface::ip` seems to be only for IPv4? [12:59:39] nevermind, seems to be valid for IPv6 too [14:25:45] toolsbeta upgrade got borked :/, looking [15:18:58] arturo: if you're still looking at the dns issues, they're happening now [15:18:59] rsync: getaddrinfo: deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud 873: Temporary failure in name resolution [15:19:35] RhinosF1: I'm on a meeting [15:19:39] try dig +trace [15:19:58] arturo: from CI so unfortunately can't [15:20:12] RhinosF1: can't you try from the base VM? [15:20:22] arturo: don't have access to that [15:20:33] https://integration.wikimedia.org/ci/job/beta-scap-sync-world/174764/ is the job [15:20:50] have a VM name? [15:21:08] arturo: Took 2 min 42 sec on deployment-deploy04 [15:21:33] 15:17:51 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@deployment-parsoid14.deployment-prep.eqiad1.wikimedia.cloud) returned [10]: rsync: getaddrinfo: deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud 873: Temporary failure in name resolution [15:21:36] is full error [15:21:50] so deployment-deploy04 trying to resolve parsoid14 [15:21:52] yeah, but where was that command executed? [15:22:06] on deployment-deploy04 arturo [15:22:17] trying to resolve a number of depoyment-prep VMs [15:27:36] ok [15:31:28] I checked a bunch of things, but I don't see anything weird [15:36:07] dcaro: this ready to merge? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/545 [15:36:33] yep, done [15:38:06] btw. the issue with the securityContext, is because the new tekton adds the annotation `pod-security.kubernetes.io/enforce=restricted` to the namespace itself, forcing the policy [15:38:18] (in case anyone was interested xd) [15:47:17] RhinosF1: I have not detected anything wrong, I'm sorry [15:54:24] No [15:54:26] Np [16:05:38] I think I have fixed toolsbeta ... it's not nice :/ [16:06:28] I had to manually patch some stuff, to skip the webhooks while upgrading, then enable them, and finish deploying manually the rest [16:06:41] (manually as in, using helm directly) [16:18:22] Raymond_Ndibe: blancadesal fun with helm https://etherpad.wikimedia.org/p/2024_tekton_upgrade , the helm chart tweaking/inspecting might be useful [16:18:50] Hello dcaro, still around? I'm getting ERROR: S3 error 403 (QuotaExceeded) while trying to upload a tar file as low as 4mb to the s3 bucket harborstorage on toolsbeta horizon [16:19:13] ooh thanks dcaro taking a look [16:19:41] hmmm, that seems like actual quota [16:19:42] any idea how to debug QoutaExceeded errors like this? I don't think I have enough permission for that [16:19:58] let me have a look, should be on openstack side I think [16:22:49] it says it's using 8G [16:22:52] https://usercontent.irccloud-cdn.com/file/QVcecf3K/image.png [16:24:06] yeaa, I think that one is a different thing [16:24:56] for some reason deleting the stuffs you uploaded to the storage doesn't completely clean it up. you'll notice there is nothing in it rn but it's still occupying 8G [16:25:33] anyways that shouldn't be an issue unless 8G is the default storage quota. How do I check that? increase that? [16:25:38] it might be that it does not clean up correctly [16:26:14] as in, you tell radosgw to delete stuff, but maybe openstack does not notice, or similar [16:26:24] looking [16:27:32] I got back to doing the harbor storage thing again and it feels like I'm debugging it with 1 hand tied behind my back because I don't have all the access I need to do things on horizon. I think the issue is more on the openstack s3 side but I have limited visibility into what's actually going on or the logs [16:27:51] that's the limit yes [16:28:08] https://www.irccloud.com/pastebin/4LCSHGft/ [16:28:26] it should free the space :/ [16:28:33] how are you deleting the stuff? [16:28:53] ok that makes sense. Because I was able to upload stuffs initially [16:28:54] btw. what is it that you can't do? [16:29:05] can you sudo on cloudcontrol? [16:29:33] I was deleting it through horizon. also tried through s3cmd [16:30:58] yes I can sudo on cloudcontrol [16:31:06] then you have full openstack cli access :) [16:31:14] (it's more than the UI) [16:31:14] xd [16:31:28] (though harder to use...) [16:31:40] I can't do some advanced things on horizon [16:32:09] ooh, do we have any decent documentation? [16:32:12] it might be that nobody can, what can't you do? (from the cli you can add yourself to any groups/permissions also if needed) [16:32:39] there's some, but the cli changes and sometimes the parameters change, what's possible to do changes too [16:33:05] if you look in wikitech, the admin pages for openstack services usually show commands to run [16:33:23] (there's no full help on the cli, it's more of help on the services, and the cli commands spread around) [16:33:31] almost everything under `Identity` [16:34:12] I will check wikitech again, though I wasn't that much helpful the first time [16:34:39] I'm mostly interested in viewing the logs of the operations I'm making [16:35:19] the cli will not help there, you can try journalctl and such, though it's spread through the control nodes, cloudlb nodes, etc., that's why we have the logstash stuff [16:35:46] if it's of any consolation, the identity->users ends up in timeout for me [16:38:22] yep, rados sees that toolsbeta is using it's quota [16:38:25] https://www.irccloud.com/pastebin/fBsAokng/ [16:40:40] It's ok. I'll continue looking. need to find a more reliable way of seeing what is happening beneath the hood. I will play with it in cloudcontrol and bit more and see if I'll have any breakthrough [16:43:24] dcaro: I'm also getting a weird error on lima-kilo registry-admission functional test. It's trying to make a request to `https://toolforge-external-load-balancer:6443` and that fails with `no such host`. Are you aware of this? [16:44:10] no, that does not ring any bells [16:44:35] oh, maybe, it's the reachable from outside test? [16:51:08] I doubt that. unless there is a special way to reach it that I don't know of [16:53:04] iirc to be able to reach the tool webservices from within lima-kilo there was some monkeying around with hosts [16:53:05] looking [16:53:37] should not be using that one [16:53:47] https://www.irccloud.com/pastebin/iOOY1eg2/ [17:06:00] Raymond_Ndibe: I suspect that the garbage collection process in radosgw is not working as expected, that should be cleaning up deleted objects [17:06:17] can you open a task for it? I'll look into it next week [17:07:38] I see objects that should have been deleted a month ago [17:07:50] https://www.irccloud.com/pastebin/AvrMueMv/ [17:11:19] I manually triggered it, but it seems it did not help [17:11:31] https://www.irccloud.com/pastebin/P1cu5LBW/ [17:12:49] something is weird yep [17:12:55] https://www.irccloud.com/pastebin/Bnu67kWc/ [17:14:14] gtg, if you open a task and subscribe me I'll give it a look on monday