[07:59:30] hello folks [07:59:50] I am going to disable puppet on install1003 to manually set [07:59:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/654192/5/modules/install_server/files/tftpboot/buster-installer/pxelinux.cfg/ttyS1-115200 [08:00:05] and test a pxe rescue for sretest1001, to see if it works [08:03:13] sounds good [08:41:07] moritzm: qq - can I restart atftpd on install1003? I suspect that new config files are not picked up [08:46:59] sure anytime [09:05:31] very weird, the new label is not picked up when I test it [09:05:39] boot: rescue leads to "not found" [09:09:37] having a look [09:09:58] <_joe_> that part of the stack is absolute black sorcery to me, I can't really help with thats [09:11:01] there are some tftpboot config files on apt1001, lemme test if they are picked up for $weird_reasons [09:11:50] yep now it works [09:11:57] moritzm: --^ [09:12:06] yeah, I was about to say that [09:12:24] is it expected? [09:12:25] these are sourced from apt1001, install* only does DHCP [09:13:08] ok then it definitely confuses me, atftpd is on install1003, together with /srv/tftboot configs [09:15:24] maybe, I need to look up the finer details, some things changed with the split of the install servers for pop sites [09:15:32] <_joe_> I'm sure there is documentation somewhere... [09:15:54] <_joe_> you're supposed to laugh, that was a joke (albeit a sour one) [09:16:55] it was a joe-ke [09:17:29] so the tftpboot file are puppetized on both apt1001 and install1003, I just tested it, so we are good consistency-wise [09:17:40] ok [09:17:50] aaand the rescue label works! [09:18:13] it is not offering a fancy menu' etc.. but that can be done later on if we need it [09:18:40] going to update the code review [09:19:44] IIRC rescue modes runs through the initial steps of the installer (so detect disks etc) and then drops to a shell, if you get that, all is fine [09:23:21] yep yep I saw the menu for rescue, and tried to select some options etc.. (it starts asking for the language and other stuff so it is very different from the d-i regular install, easy to spot) [09:26:06] sounds good. we can tweak this and create a custom preseed_url for the rescue mode [09:29:54] yep definitely [10:12:49] dcaro moritzm ok to push your puppet changes? [10:13:20] yeah [10:15:59] dcaro: yours too? [10:16:48] marostegui:yep, you can go [10:16:58] ok, done [10:18:39] thanks! [12:33:26] moritzm: do you happen to know for sure if profile::mail::smarthost is only used inside WMCS? [12:33:36] puppetdb seems to agree [12:35:46] git grep too [12:35:50] yeah, that seems to be specifically created for wmcs and it's unused in prod [12:35:57] ack [12:35:58] https://phabricator.wikimedia.org/T41785 is the original task [12:35:59] thanks! [12:51:14] Added some info about PXE rescue to https://wikitech.wikimedia.org/wiki/Debian_installer_rescue_mode#Option_2%3A_pxe-bootable_rescue_image [13:38:15] "urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='integration.wikimedia.org', port=443): Read timed out. (read timeout=10)". hopefully temporary [13:43:21] sukhe: see you broke our CI :D [13:45:59] elukey: as is tradition! [15:58:59] came accross this the other day, not played with it yet but thought some here may find it curious https://omar.website/tabfs/ [16:29:02] herron, arturo and I have made more progress with the mail exchanges and are back to the point of wondering if it's now correct or not. [16:29:12] We can send ourselves emails from cloud VMs without it sounding alarm bells… are there more things we should test? [16:30:16] this is why you should always define victory conditions before going to war :( [16:31:35] yeah I'd say if mail is flowing while using the new certificate, and the queues and logs look clear it's in good shape [16:31:57] is there monitoring of the client system mail queues in that environment? that would be a good indicator [16:33:17] herron if I run `openssl s_client -connect localhost:25 --starttls smtp` I don't see the new cert [16:33:39] example: [16:33:40] https://www.irccloud.com/pastebin/jagAhVcS/ [16:34:12] herron: I don't know about monitoring; I've literally never thought about these mail exchanges before last night [16:35:04] we have https://grafana-labs.wikimedia.org/d/HcDsu-WGk/toolforge-email-dashboard?orgId=1 andrewbogott [16:35:40] arturo: that is consistent with the contents of mx-out01:/etc/acmecerts/mx/live/rsa-2048.chained.crt [16:35:58] herron: so I think the problem is acme-chief is not generating the cert [16:36:26] herron: andrewbogott: see https://phabricator.wikimedia.org/T260834#6722402 [16:36:47] acme chief cannot generate the cert because it cannot update the DNS zone [16:37:21] hm, so something with the designate integration is broken? [16:37:23] hmph [16:39:43] 'Unexpected return code spawning DNS zone updater: 1' is less information than I was hoping for [16:42:53] yeah the debuggability there is less than ideal. probably those errors should be louder in some way too, so they don't just go by without notice [16:43:05] https://www.irccloud.com/pastebin/foi3Jb8t/ [16:43:09] that's a little better... [16:45:31] this is going to turn out to be because the zone for the cert isn't owned by cloudinfra I bet [16:45:59] although if that's the case it shouldn't have worked with ::integrated either I think? [16:47:54] I think both wmcloud.org and wmflabs.org are owned by the cloudinfra tenant [16:48:12] or should be, per our own policy? [16:49:05] and wikimedia.cloud I guess [16:49:15] yes [16:49:40] the logic in that script is hard to follow, but I'm wondering if it has issues with too-deep level of subdomain in the name or something [16:49:43] I think I need to create mx-out.wmcloud.org in cloudinfra [16:49:45] working on that now [16:50:01] yeah maybe that [16:50:27] although it's probably not properly a 'zone', but maybe if there's no records at that name or below at all, it trips up the zone[0] bit [16:50:34] err potential_zones[0] [16:50:44] mmm makes sense, the record is something like `_acme-challenge.mx-out.wmcloud.org` [16:57:06] I added mx-out.wmflabs.org and mx-out.wmflabs.org and mx-out01.wmflabs.org and mx-out02.wmflabs.org [16:57:14] and now it's at least not erroring out in that step :) [17:01:26] arturo, "openssl s_client -connect localhost:25 --starttls smtp" output looks right to me now, do you agree? [17:01:49] wonderful :-) [17:02:03] I agree [17:02:47] cool, now I'm going to finally close that bug. Thank you arturo, herron, et al [17:03:08] +1, awesome! [17:03:24] odd that a snakeoil cert is issued in that case, but glad its working now [17:04:23] yeah, a noisier failure would probably be better? [17:05:58] well, not sure, at least it still kept serving emails [17:07:48] true [18:23:58] herron o/ - after https://gerrit.wikimedia.org/r/c/operations/homer/public/+/654469 all the analytics nodes should be able to ship syslog to kafka logging, so traffic will probably increase.. shouldn't be a problem but if you see something weird blame me :D [18:24:47] elukey: haha, great thx for the heads up will keep an eye