[08:48:53] volans: i tried to behave this time and use the decom cookbook, but it betrayed me: https://phabricator.wikimedia.org/P13789 [09:05:49] kormat: that's called vengeance :-P [09:05:51] * volans looking [09:07:20] volans: i manually ran sre.dns.netbox, and it completely successfully this time [09:07:50] kormat: I'll look more in depth on the why, but for now given that the VM was deleted from ganeti and since then removed from netbox automtically, running the sre.dns.netbox cookbook is enough to fix things [09:09:17] ok :) [09:13:30] kormat: for completeness, if it was a physical a re-run of the decom should have sufficed, but I need to have a deeper look to understand what was wrong in this case, might be a race [10:14:05] akosiaris: removed an obsolete manual step from the ganeti docs: https://wikitech.wikimedia.org/w/index.php?title=Ganeti&type=revision&diff=1893787&oldid=1884161 [10:14:59] kormat: that's not yet automated [10:15:24] volans: huh. the link it contained was a 404, and netbox contains the interfaces of the vm i just created [10:15:26] there's a TODO to do that automatically in the ganeti->netbox sync, but not yet done [10:15:48] https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ [10:16:49] ah. so the link was just wrong. `.` vs `/` [10:16:52] the diff is that your iface is still called ##PRIMARY## [10:17:08] instead of the correct name [10:19:03] i tried a dry-run, it failed: https://netbox.wikimedia.org/extras/scripts/results/33747/ [10:19:23] hmm, that part I am not familiar with at all, better call volans. [10:19:30] oh he already showed up :P [10:19:32] kormat: interesting, thanks [10:19:39] looking/fixing [10:19:50] you're my bug-hunter this morning apparently :D [10:20:01] can i resign, pls? [10:22:32] that's most likely due to the recent netbox 2.9 upgrade, looking in a minute [10:30:20] yeah, I'd say let's remove `({dif.type})` from https://github.com/wikimedia/operations-software-netbox-extras/blob/master/customscripts/interface_automation.py#L502 [10:33:30] agree [10:34:05] but is only that? [10:34:15] line 536 has type=iface_fmt too [10:34:31] kormat will tell is very soon anyway if there are other bugs [10:35:20] lol [10:35:47] yeah I think it's broken too [10:35:49] let me do a proper fix [10:36:20] volans: 533 might be a VMInterface instead of Interface? [10:36:57] before it was just interface [10:37:02] and we were setting InterfaceTypeChoices.TYPE_VIRTUAL [10:37:06] if VM [10:37:23] apparently they changed everything and was not catched in the tests on netbox-next, let me add it to the task [10:40:48] and doesn't seem to be explicitly mentioned in the changelog at first grep [10:41:32] was in next page, my bad :) [10:42:18] "Although this change is largely transparent to the end user..." [10:42:57] "...however any custom code which references or manipulates virtual machine interfaces will need to be updated accordingly." [10:45:32] kormat: for now leave it as is, I've updated the task with the related info, it doesn't change anything functional, its just for data consistency and cleanliness [10:45:38] sorry for the trouble [10:47:00] ok, grand :) [11:04:06] <_joe_> kormat: don't worry, you'll find more stuff that doesn't work soon [14:22:09] volans: after all that, i put the replacement vm in the wrong zone (eqiad.wmnet, instead of wikimedia.org), so now i need to tear it down and start again. 🤬 [14:24:45] kormat: that's called karmat :-P [14:25:42] oww [14:25:59] <_joe_> 🤦 [14:26:08] volans doing a kormat joke, the week starts well [14:26:11] <_joe_> that was for riccardo's ""joke"" [14:26:25] 🥀 [14:32:05] volans: good news! makevm broke in a new way: https://phabricator.wikimedia.org/P13800 [14:33:13] kormat: did sre.dns.netbox run ? [14:33:55] kormat: The given name (dborch1001.wikimedia.org) does not resolve: Name or service not known [14:33:56] i think so. there's definitely a bunch of netbox stuff [14:35:10] kormat: did you try to resolve dborch1001.wikimedia.org before? [14:35:32] my guess is that the local resolvers have cached the negative answer [14:35:33] not manually, but yes [14:35:37] sigh. [14:35:38] we can clear them [14:35:59] because dig doesn't resolve [14:36:06] but dig @nsX.w.o does [14:36:13] kormat: https://wikitech.wikimedia.org/wiki/DNS#How_to_Remove_a_record_from_the_DNS_resolver_caches [14:36:38] the output should tell you if it was negative cached [14:36:49] is that something the netbox cookbook could do in future? [14:37:20] nothing forbids to add it [14:37:55] dns1002 had 2 negative records, if i'm reading this correctly. [14:38:22] now.. before you re-run the makevm cookbook [14:38:28] we need to do a small cleanup on netbox [14:38:43] (yes this could be included in the createvm cookbook IIRC there is a task) [14:38:57] T272068 [14:38:58] T272068: Prevent re-using network ports when provisioning hosts in Netbox - https://phabricator.wikimedia.org/T272068 [14:39:32] kinda, not exactly but similar [14:39:58] kormat: basically delete the 2 addresses that comes from https://netbox.wikimedia.org/ipam/ip-addresses/?q=dborch1001 [14:40:35] volans: and then re-remove the negative caches i guess [14:40:41] why? [14:40:55] no, then just re-run the makevm [14:40:55] volans: because ferm is trying to resolve them on a few hundred db machines [14:41:10] no, stay with me [14:41:22] you just delete from netbox and *not* run the sre.dns.netbox cookbook [14:41:28] but just re-run the makevm one [14:41:39] that will 99.9% reassign the same IPs to dborch1001 [14:41:49] so the sre.dns.netbox run will be a noop [14:42:18] and the dns will stay happy [14:42:29] 0.1%, how lucky do you feel? [14:42:42] 🤠 [14:43:32] the races here are: [14:43:44] - someone assigns an IP before you rerun [14:44:05] - someone runs the dns.netbox cookbook before you so that the dns records gets deleted and the negative caches repopulated [14:44:18] netbox picks the first available IP, it's not random [14:45:20] - i run out of the will to continue [23:36:16] effie: what do you see as next steps for "scap restarts php". I would assume we need to disable automatic opcache revalidations in php.ini, right? do we want to do that at the same time, or wait? After that, I'm guessing the cronjobs will no longer be needed either since there'd be no revalidations happening anymore. Getting alerts/telemetry is still useful ofc, I don't know if those are currently separate from the restart action or if [23:36:16] the restar thappens from the same cron that informs nagios alerts.