[06:25:46] I've created this for the DNS issue: https://phabricator.wikimedia.org/T428541 [07:09:58] !incidents [07:09:58] 8064 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:10:01] !ack [07:10:02] 8064 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:10:12] * Emperor woken up by the p.age will be awake soon hopefully [07:11:41] +1 [07:13:07] marostegui: I think I have a fix for the DNS issue, working on it now [07:13:16] cc: federico3 [07:13:17] elukey: <3 [07:19:39] marostegui: can you retry authdns update? [07:20:16] elukey: thx, indeed that should be the fix https://netbox.wikimedia.org/extras/changelog/280023/ [07:20:54] and this the cause: https://netbox.wikimedia.org/extras/changelog/280014/ [07:21:23] elukey: trying [07:21:34] elukey: I'm wondering if we should add a Netbox validator to ensure a DNS name is unique per address family (no more than 1 v4 and 1 v6, or if there are some edge cases [07:21:56] XioNoX: it would be nice yes, these are things that may impact multiple people :( [07:22:33] federico3: your change looks mergeable now [07:23:03] elukey: worked like a charm [07:23:28] elukey: thanks [07:39:49] elukey: https://phabricator.wikimedia.org/P93941 [07:40:42] you can ignore the anycast one, dunno if the `blubberoid` is a typo or on purpose https://netbox.wikimedia.org/search/?q=blubberoid.svc.eqiad.wmnet [07:41:55] effie: ^ [07:43:08] oh you monster [07:43:28] fixed reth0-1134.pfw1-eqiad.frack.eqiad.wmnet and ehternet-1-48.asw1-23-ulsfo.ulsfo.wmnet [07:44:06] but I think we're good to add a validator on dns name, for the anycast ones we can just ignore if the IPs are similar (cc topranks) [07:44:52] effie: see https://netbox.wikimedia.org/search/?q=blubberoid.svc.eqiad.wmnet both eqiad and codfw IPs have a .eqiad. fqdn [07:45:24] XioNoX: I was checking before answering [07:45:39] blubberoid has been retired since a long time ago [07:45:52] no rush, just being polite by giving more context than just a ^ :) [07:47:01] it just slipped through the cracks I reckon [07:48:06] XioNoX: nice! [07:48:56] XioNoX: filed a task, cheers [07:50:39] effie: thx, can you share the task ID? [07:54:00] I subscribed you [07:55:01] XioNoX: +1 for the validator. The Anycast IP objects should have a role=anycast so we can ignore based on that, and/or only alert if two dns_names point at different IPs [07:57:49] topranks, elukey: opened https://phabricator.wikimedia.org/T428546 feel free to grab it :) [08:01:55] XioNoX: is that a mild nerd snipe attempt? :D [08:03:16] Just a great opportunity! [08:03:31] hahahaha [08:03:37] only 1 left, hurry before it's too late [08:03:41] I’m happy to take a look [09:26:20] elukey is this normal? [09:26:22] [09:25:57] marostegui@cumin1003:~$ host 10.67.28.73 [09:26:22] Host 73.28.67.10.in-addr.arpa not found: 2(SERVFAIL) [09:27:55] marostegui: lovely, never seen it before. Does it happen only with that IP, or more? [09:28:22] elukey: I was debugging something and found that it, I just tried with another host and it does look fine [09:31:13] marostegui: but how did you come up with the IP? Is it appearing in some log etc..? Seems a k8s one in theory, so I am wondering if IPs belonging to dead pods may end up with servfail and not nxdomain for $reason [09:31:29] elukey: yeah, it was from a show processlist in a database [09:31:37] And I wanted to check which host it was [09:31:39] To ssh there etc [09:32:39] I tried other stuff like 10.67.28.71 etc.. and they end up in servfail, so I am almost sure the IP you found was related to a pod that got killed [09:33:34] elukey: but that IP is actively doing things eh [09:35:45] marostegui: oh ok I didn't get this bit before [09:36:15] mmmm maybe it is related to coredns on the target k8s cluster [09:37:00] netbox suggests that it is a dse k8s pod https://netbox.wikimedia.org/ipam/prefixes/538/ [09:37:17] elukey: The thing is a bit tricky, because there's an IP doing stuff but I cannot access it as it doesn't resolve a name [09:37:43] marostegui: yeah I know, lemme try to find more info about the IP. [09:37:54] elukey: <3 [09:43:52] wikidatawiki-sql-xml-wikidatawiki-dump-remaining-full-ev2anki running on dse-k8s-worker1014.eqiad.wmnet [09:43:57] marostegui: --^ [09:44:19] elukey: oh thanks [09:44:27] elukey: in any case, shouldn't that IP resolve to something? [09:45:07] yeah for sure, I have no idea why it doesn't do it [09:45:26] but I don't have the specific about how it is implemented for other clusters, maybe something is missing for k8s dse [09:45:33] topranks: any idea? [09:46:07] TL;DR 10.67.28.73 is a pod that runs on k8s dse eqiad and reverse lookup fails with servfail [09:49:09] * topranks looking [09:49:18] SERVFAIL is odd though [09:49:35] If you guys want me to create a task, let me know [09:50:17] I get an answer, but I think it reveals what might be the issue, two hostnames returned [09:50:20] https://www.irccloud.com/pastebin/IalIrHgo/ [09:50:39] I bet it's Arzhel trying to nerd-snipe us even more into making that Netbox validator :P [09:51:04] wait sorry that's the NS entries.... this is a k8s IP [09:52:50] so yeah the issue is the control nodes for that k8s cluster don't see themselves as auth for the range [09:52:57] let me double check the delegation is still valid [09:58:41] elukey: the control plane nodes for the dse cluster we have in the dns are still correct it seems: [09:58:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/kubernetes.yaml#873 [09:59:57] in fact they do return answers for other IPs in the same range [10:00:01] cmooney@cumin1003:~$ dig -4 +noall +answer -x 10.67.28.1 @dse-k8s-ctrl1001.eqiad.wmnet [10:00:01] 1.28.67.10.in-addr.arpa. 5 IN PTR 10-67-28-1.postgresql-growthbook-next-r.growthbook-next.svc.cluster.local. [10:00:01] 1.28.67.10.in-addr.arpa. 5 IN PTR 10-67-28-1.postgresql-growthbook-next-ro.growthbook-next.svc.cluster.local. [10:01:18] So long story short I don't know why they aren't returning anything for the other IP [10:01:30] Some k8s / core dns magic is missing I assume but beyond me sorry :( [10:10:07] * elukey nods [10:21:14] thanks for checking <3 [10:27:08] topranks: is it possible that the reverse delegation config in the dns repo is missing https://netbox.wikimedia.org/ipam/prefixes/538/ ? I see only the old /24 pod subnets for dse, they now have the new /21 [10:27:53] no ok there are multiple /24s okok [10:29:10] yeah it has to be done on a per /21 basis [10:29:21] but even then the IP I checked above is in the same /24 so should be ok [10:29:34] I don't know why you get SERVFAIL though, that is still puzzling me, I didn't get that [10:30:04] * marostegui another successful nerdsnipped [10:30:21] expertly done :D [10:36:08] marostegui: jokes aside, let's open a task [10:36:24] elukey: Ok, doing it [10:38:54] elukey topranks https://phabricator.wikimedia.org/T428573 [13:17:32] codfw A4 maintenance is finished [13:17:44] moritzm: you can repool the ganeti hosts anytime now [13:18:24] ack, will do that in a bit [14:58:55] XioNoX: for the netbox patch, I just merge and wait? [14:59:03] sure [14:59:05] excellent [15:11:19] XioNoX: nice doing business with you, will buy again, A+++