[08:28:56] Raymond_Ndibe: what is the current status of the deploy pipeline? [08:58:19] is there a task tracking the deploy pipeline issue? [10:23:47] afaik no [12:55:49] it's already fixed [13:06:49] thanks for the update [13:57:29] Raymond_Ndibe there seems to be some issues still [13:57:54] https://www.irccloud.com/pastebin/4mjsZ8Ni/ [13:59:10] I think you just need to retry it again or something. first attempt should update the branch if it's not already up to date [13:59:56] a second run fails in the same way [14:00:18] `toolforge-deploy/utils/run_functional_tests.sh: invalid option -- 'b'` [14:00:30] this is on toolsbeta? [14:00:33] yes [14:00:38] looking [14:03:50] seems to be a bug. the branch is checked out but is pointing to a different commit https://www.irccloud.com/pastebin/3z7cWEVQ/ [14:04:31] will update the branch from toolsbeta so you can deploy, then get rid of the bug [14:04:54] thanks [14:06:45] should work now on toolsbeta. I might need to do the same on tools too [14:10:53] topranks: what do you suggest as next steps for T374830 if we want to focus on the dns issues? I can certainly reproduce the dns failure from within the cloud although it's intermittent so often takes a while to see. [14:10:54] T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [14:17:46] andrewbogott: I'm not sure tbh. Look for patterns I guess [14:17:58] like where can we reproduce it from, when does it start happening etc [14:18:18] try to narrow down the factors [14:18:53] the long ping response times are definitely not good, but I think the DNS thing must be something else as I doubt the query timeout for those clients is less than 1 second [14:19:06] Do you not think the 800ms ping times matter, or do you think that's an unrelated but still a real issue? (Remember that in my initial tests the slow pings corresponded to the dns failures) [14:19:10] some pcap's of what happens when we see it would help (as ever) [14:19:22] it's definitely "real" [14:19:43] and likely an issue, though perhaps there is a benign explanation for it (like something is just not prioritising ping and that's why) [14:20:07] I don't think 800ms is gonna make a dns query time out though (despite that being a shocking rtt), so likely something else is causing the timeouts [14:22:15] I don't think I know how to do the pcap. Is that a cli thing or something that happens on the switch? [14:35:48] I'm about to depool some ceph mons. I don't expect noticeable effects but please let me know if anyone sees ceph issues. [14:44:53] pcap is just a tcpdump of the traffic [14:45:16] sudo tcpdump -i -s0 -w /tmp/filename.pcap [14:45:38] gotta watch you're not dumping 60GB of traffic though - I hear some idiot did that recently :P [14:45:42] so a filter might be an idea [14:45:49] sudo tcpdump -i -s0 -w /tmp/filename.pcap port 53 [14:47:05] topranks: ok, I will try although it'll be a big dump if the issue only happens once/hour :) [14:47:21] dhinus: do you know if openstack host aggregates are managed by tofu? [14:47:39] Raymond_Ndibe: toolsbeta worked now, tools still not [14:48:47] andrewbogott: so the problem is we get 1 timed-out dns query every hour? [14:49:12] it's not predictable, just sometimes takes an hour to see. Just now it happened 5 minutes after I started the test. [14:49:15] or at least that's the extent to which we've been able to see it [14:49:16] yeah [14:49:22] ok [15:16:14] blancadesal: I'm aware of that. Currently testing a fix [15:16:23] ok! [15:18:45] the problem was with `git reset --hard FETCH_HEAD`. turns out `git fetch --all` doesn't update `FETCH_HEAD`. solution is to use `git reset --hard origin/` instead. Wasn't obvious before because were only testing on the main branch [15:19:21] Are there any tickets open involving an openstack upgrade to dalmatian? [15:22:57] blancadesal: should be fixed now [15:27:42] Raymond_Ndibe: getting this on tools now: [15:27:48] https://www.irccloud.com/pastebin/bwPDsyvz/ [15:28:42] sometimes you might need to rerun it [15:51:22] andrewbogott: re: openstack aggregates, they are not in tofu at the moment. There is a ticket [16:08:29] thanks arturo [17:51:43] andrewbogott: FYI I'm no longer draining the cloudvirt today, postponed to tomorrow! [17:51:51] ok! [18:08:46] tf-infra-test failed with a new error: lookup puppet-enc.cloudinfra.wmcloud.org on 172.20.255.1:53 i/o timeout [18:09:07] I guess yet another occurrence of the DNS issue? never seen that before in tf-infra-test [18:17:36] added a comment to T374830 [18:17:37] T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [18:45:17] dhinus, if still here, any idea why the tools-legacy setup keeps failing? I rebooted it and it worked for a while but seems locked up again. [18:45:28] I can investigate, just checking in case you already have [18:45:37] andrewbogott: not checked yet no [18:45:44] ok! [18:56:24] wow this server sure is busy! [18:56:54] * dhinus offline [18:59:59] it's been a long time since I looked at how much traffic is still coming into toolserver.org and the answer is: lots [19:07:08] andrewbogott: last I looked toolserver.org was mostly old things requesting tiles over http [19:07:44] at some point i was thinking about hsts preloading that domain and the disabling plaintext http access [19:08:08] (also that server also handles tools.wmflabs.org which is a different story) [19:08:16] it seems pretty varied right now but eyeballs are not great at producing statistics [19:09:31] well, no, it's mostly pages under ~geohack ~kolossos and ~para