[08:48:20] arturo: disable puppet and set quiet=no [08:48:23] that is a cool hack :-] [08:48:53] let's see how that goes [08:49:20] I will see whether I can make Jenkins to notify/emails out whenever `Could not resolve host` is found in a build output [09:00:31] arturo: I found one at 08:30 or 30 minutes ago [09:01:02] hashar: let's wait for the next, I don't think the logs were enabled back then [09:01:03] 08:30:44 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/EventLogging/': Could not resolve host: gerrit.wikimedia.org' [09:01:03] from https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium/44258/console [09:01:10] yeah I think it was too short [09:02:22] I went to grep through the console log files from the last 60 minutes using: [09:02:22] ssh contint.wikimedia.org find /srv/jenkins/builds -maxdepth 3 -name log -cmin -60 -exec grep "'Could not resolve host'" {} \+ [11:31:55] dhinus: are you around for a couple of tofu-infra reviews? [11:32:04] oh, you are at a conference, no? [11:32:24] nevermind then [12:31:24] arturo: we had some more dns failure at 10:38:47 11:07:18 11:07:24 11:07:09 11:07:14 11:07:35 :) [12:35:16] hashar: ok, let me see [12:36:14] hashar: which name? gerrit again? [12:36:27] for the timestamps above yes for gerrit.wikimedia.org [12:36:49] there were a couple originating from deployment-prep which were making rests to commons.wikimedia.org [12:36:59] have an origin IP address? [12:37:05] ah [12:37:17] one of the VMs having errors [12:37:30] the timestamps above are all from different VMs [12:38:02] ok, at least one would be enough for one, so I can grep the logs [12:38:09] for now* [12:38:21] from deployment-prep , the requests to commons.wikimedia.org failed at 12:36:26 and 13:05:46 and originate from deployment-mediawiki14 [12:38:47] ok, that's 172.16.3.170 [12:44:45] arturo: I'm at the conference yes, I can review on Monday :) [12:46:36] hashar: unfortunately the pdns logs are too much for journalctl, I don't know if we are missing messages [12:46:39] Oct 04 12:36:10 cloudservices1005 systemd-journald[2014739]: Suppressed 60916 messages from pdns-recursor.service [12:46:39] Oct 04 12:36:40 cloudservices1005 systemd-journald[2014739]: Suppressed 65894 messages from pdns-recursor.service [12:46:42] dhinus: ack [12:49:22] arturo: bummer :D [12:50:30] then maybe that is because they were all duplicate messages of whatever message is above? [12:50:44] hopefully! [12:54:30] hashar: I'm not finding anything at first sight [12:54:52] I noticed the question is never "Question answered from packet cache" for commons.wikimedia.org [12:54:58] I don't know why [12:55:24] aborrero@cloudservices1005:~ $ sudo journalctl -u pdns-recursor.service | grep "packet cache" | grep qname=\"commons.wikimedia.org\" | wc -l [12:55:24] 0 [12:55:24] aborrero@cloudservices1005:~ $ sudo journalctl -u pdns-recursor.service | grep "packet cache" | grep qname=\"gerrit.wikimedia.org\" | wc -l [12:55:24] 25173 [12:55:56] because the number of log entries, each command like that takes 2m to run [12:56:07] so walking this is a bit slow [12:56:47] yeah I guess it is way too verbose if it logs every single queries that it receives [12:57:30] on the task I have mentioned `log-common-errors` which might yield less. Hard to know what it would log though [13:01:11] that is already enabled [13:01:31] and disable quiet since that seems to log every single queries? [13:01:52] it is hard to know though, since we don't know what is the actual root cause or what should be logged :D [13:01:55] then we would be back to the initial setup [13:02:33] my alternative is to spam dns queries while capturing packets in the hope the issue eventually surfaces [14:22:54] I started writing this yesterday to run inside a container in a lop xd [14:22:58] https://www.irccloud.com/pastebin/BzBHgQHG/ [14:23:15] (untested and unfinished) [14:26:35] can you paste it to the task? :) [14:26:41] or somewhere in a git repo maybe [14:27:30] my stupid one was https://phabricator.wikimedia.org/P69462 [14:28:07] spam queries over 8 threads with 250ms intervals [14:29:05] your looks greater with the tcpdump capture built in [14:31:03] I am off [15:41:20] * arturo offline