[03:24:05] * dhinus paged: ProjectProxyMainProxyDown project-proxy (proxy-04 main-nginx-https page wmcs) [03:29:11] a similar alert triggered last thursday for the other proxy (proxy-03) and auto-resolved after 5 minutes. this time it's still firing after 15 mins [03:29:31] checking the proxy-* vm state in horizon [03:29:55] both proxy-03 and proxy-04 are shown as "running" [03:31:11] I can ssh to "proxy-04" [03:32:15] unit "nginx.service" is failed, among a few others [03:35:46] nginx: [emerg] invalid number of arguments in "resolver" directive in /etc/nginx/sites-enabled/catalyst-qte-wmcloud-org:19 [03:37:13] restarting the unit fails with the same error [03:40:03] I created an incident doc at https://docs.google.com/document/d/1VBLbLgg5dt_A_g_jHwxgyRMw9amINri4DPe6jc6bshc/edit?tab=t.0#heading=h.m4uq0l1dvtgq [03:42:47] I am the IC for now [03:48:07] the similar alert from last thursday seems unrelated: the nginx logs for proxy-03 don't include any error on Thursday [03:54:05] the error is caused by a missing IP in /etc/nginx/sites-enabled/catalyst-qte-wmcloud-org:19 [03:54:31] I compared that file between proxy-03 and proxy-04, in proxy-03 the line is "resolver 172.20.255.1;", in proxy-04 it is "resolver ;" [03:55:07] forcing a puppet run fails with "getaddrinfo: Temporary failure in name resolution" [03:55:39] this sounds like another instance of T379927 [03:55:40] T379927: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927 [03:56:24] I edited /etc/resolv.conf manually and restarted puppet [03:57:01] that fixed the issue with puppet AND the issue with nginx [04:12:25] the alert is no longer firing in alertmanager, but it was not auto-resolved in victorops. I'm resolving it manually [04:13:58] I am resolving the incident. [04:14:08] I have reopened T379927 and added a comment there. [04:14:09] T379927: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927 [04:15:26] back to sleep, see you later :) [08:52:54] thanks for handling the page! [09:36:08] heads up, I'm about to deploy some new network settings into eqiad1 related to VXLAN and IPv6. No impact expected, so please report if you see anything weird [10:19:26] arturo: cool, let me know when you are done [10:19:42] the faulty link from E4 to D5 we had last week has been clean since the optic change in E4 on Friday [10:19:43] https://phabricator.wikimedia.org/T380503#10352059 [10:19:44] topranks: thanks [10:19:59] So my feelings are to put traffic back on it, but I'll wait till you're done [10:51:01] there seems to be a network outage in eqiad1 [10:51:10] most likely related to the IPv6 changes, I'll try to undo them [10:51:21] that might be why toolsbeta-harbor stopped answering xd (/me was sshing) [10:51:35] let me know if you need me to do any test or anything [10:51:47] ...and why functional tests are failing on toolsbeta :) [11:16:04] yeah we didn't get anything similar [11:16:34] the networks are all separate /21s so in theory it all should be fine, worst I'd have expected is the *new* networks not to work right for some reason [11:18:14] tricky to know how to move forward with it right now [11:18:49] arturo, dcaro: anyway - I will go ahead and "re-pool" that link from D5 to E4 unless anyone objects? [11:19:01] topranks: ok for me [11:19:02] topranks: sounds good to me [11:19:10] ok cool, I'll monitor of course [11:19:16] thanks! [11:19:25] yeah, I don't know how to move forward at the moment, I'll need to think for a bit [11:21:30] * dcaro lunch [11:24:55] link looks ok still, 100kpps on it and no errors so far [11:25:08] I'll keep an eye on it over the day and close the task if it remains clean [12:22:40] heads up I'm migrating the designate zone db.svc.eqiad.wmflabs to the cloudinfra project T380491 [12:22:40] T380491: Migrate "db.svc.eqiad.wmflabs." DNS zone to cloudinfra project - https://phabricator.wikimedia.org/T380491 [12:23:04] this will simplify the toolsdb DNS change I have to perform as part of the toolsdb upgrade [12:23:41] ack [12:27:20] the zone transfer was successful [12:56:09] dhinus: there are pending plan changes in tofu-infra, you may have forgotten to run tofu apply [12:56:23] I can do myself, if that's OK [12:56:28] sure thanks [12:56:41] ok [13:44:16] arturo: updating the DNS for toolsdb resulted in designate not resolving the hostname anymore [13:44:26] maybe because i swapped a cname for an A record [13:44:31] :-( [13:44:36] horizon looks good, but dig does not resolve from the bastion [13:44:53] how are you calling dig? [13:44:58] is there any way to restart/refresh designate? [13:45:04] dig tools.db.svc.wikimedia.cloud [13:45:20] yes, with the restart cookbook [13:45:24] great [13:45:28] let me try that [13:45:55] I doubt it will be that, but a restart should not cause any harm [13:47:53] running the cookbook [13:48:14] sudo cookbook wmcs.openstack.restart_openstack --designate --cluster-name eqiad1 [13:48:17] in my laptop I can resolve it [13:48:19] arturo@nostromo:~ $ dig +short tools.db.svc.wikimedia.cloud [13:48:19] 172.16.0.168 [13:48:28] interesting [13:48:42] mmm [13:48:44] true from my laptop as well [13:48:46] I know what's happening [13:48:51] the FQDN changed zone, right? [13:49:02] well, I have a hunch [13:49:03] no [13:49:08] it only changed from CNAME to A [13:49:10] same zone [13:49:19] mmm ok [13:50:09] the cookbook restart did not fix it [13:50:25] dig shows SERVFAIL [13:50:34] which likely means invalid DNS config somehow [13:51:09] Nov 25 13:51:02 cloudservices1005 pdns-recursor[2773294]: msg="Sending SERVFAIL during resolve" error="got a CNAME referral (from cache) that causes a loop" subsystem="syncres" level="0" prio="Notice" tid="3" ts="1732542662.251" ecs="" mtid="102916031" proto="udp" qname="tools.db.svc.wikimedia.cloud" qtype="A" remote="185.15.56.63:39724" [13:51:13] dhinus: ^^^ [13:52:21] we are deleting the old CNAME [13:52:26] let's see if it improves things [13:53:20] maybe a restart of the recursor will clear the cache and that's all we need [13:53:23] want me to do that? [13:53:53] yes please [13:54:01] ok, doing [13:54:17] yes now it works [13:54:17] done on cloudservices1005 [13:54:25] 🎉 [13:54:32] thanks arturo [13:55:36] happy to help [13:56:30] (also done on cloudservices1006 for completeness) [14:07:16] topranks: there seems to be some netbox dns changes that are not being accepted by the cookbook because missing PTR zone for one of the new IPv6 subnets, T380746 [14:07:16] T380746: dns: add PTR support for 2a02:ec80:a000:: - https://phabricator.wikimedia.org/T380746 [14:10:16] hahaha [14:10:34] sorry I shouldn't laugh... my bad dude I shouldn't have left you walk into that one [14:10:43] forgot when I said add the IPs it'd happen [14:11:22] that's ok :-) [14:11:47] let me get a patch ready [14:11:55] thanks! [14:12:02] I need to step out for a bit, be back later [14:12:13] ok I'll get it fixed up [15:21:50] arturo: just fyi I removed those dns names for now in netbox, CI confused me and it was blocking the magru work, I'll re-add them and fix it once magru stuff is done [15:23:50] topranks: ok, thanks, no problem [15:37:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097390 quick review [15:57:25] taavi: I think I remember some discussion, but I don't remember if anyone tried to follow up on dropping .labsdb or even trying to figure out what tools might be impacted. I have an unsupported hunch that a number of very old tools/bot/scripts would fail, but they should also be pretty easy to fix. [15:58:20] For those unaware of the ancient history here, we deprecated .labsdb as part of the 2017 Wiki Replicas rebuild. https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ [16:33:19] oh wow I didn't even know about "tools.labsdb" :) should we add ".labsdb" to the list of legacy domains at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS ? [16:36:03] dhinus: +1 [16:58:42] arturo: added [16:59:14] I'm not sure where the records for .labsdb are stored, I don't see them in openstack [17:00:26] thanks [17:00:28] * arturo offline [17:05:17] dhinus: modules/profile/files/openstack/base/pdns/recursor/labsdb.zone in ops/puppet.git [17:28:34] anyone has ever seen a server install hang when doing mkfs? [17:28:44] https://www.irccloud.com/pastebin/Vlg8IZtY/ [17:28:58] the mkfs stays in 'D' state (uninterruptible sleep) [18:04:58] * dcaro off [18:17:15] dhinus: do we still support the list of hostnames option with cumin on cloudcumin hosts? e.g. "cumin F{"./bustedresolv.txt"} hostname"? [18:17:21] Either we don't or my syntax is wrong [18:53:35] andrewbogott: good question, I have never tried that syntax, does it return an error or an empty list? [18:54:14] if an empty list, it could be because it's matching on prod hosts instead of cloudvps hosts [18:57:23] can you try adding "--backend openstack"? if that doesn't fix it, I would open a Phab as it could be a bug we didn't discover when setting up cloudcumins [18:58:26] F{} is a separate "backend" and maybe it's not enabled [19:19:51] dhinus: T380789, not urgent. [19:19:52] T380789: Revive the HostFile backend on cloudcuminXXXX - https://phabricator.wikimedia.org/T380789 [20:23:09] Are codfw1dev projects managed in tofu atm? [21:08:54] Rook: yes [21:09:07] Neat, where is that? [21:09:28] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/OpenTofu docs https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/ repo [21:10:21] Thanks! [21:10:55] np [21:13:16] perhaps Admin/tofu-infra would be a better name for the docs page