[04:02:17] bblack: I can assist installing cp3x hosts if you need it [04:03:34] yeah I was gonna bug you and/or ema about it [04:03:54] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545691/ has the data defs [04:04:14] only cp3055-60 (6 hosts, 3 from each of text and upload) exist physically yet. in theory those 6 are installable now. [04:04:47] there's the whole usual thing to manage about icinga alerts and ipsec timing and enabling the cache::nodes entry at the right time (commented out in the patch), etc... to get them basically installed up into a depooled state [04:05:20] I think we're trying to reach that state today if we can, so we can try to pool them in and depool some of the old ones and decom them (and then hopefully get another batch or two installed soon and repeat) [04:05:53] I've puppeted the installer hosts, in theory they have all the dns/dhcp data ready for it [04:06:33] ook [04:07:31] lvs situation is similar. Only lvs3006 (the new upload lvs) is physically installed yet. Will need arzhel support to set up the router side and bring it in as another secondary first to validate it before switching it into the primary role, etc. [04:08:17] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545696/ was my draft attempt to get those set up in puppet (as non-primaries initially) [04:09:30] I guess the stuff about ipsec BS only applies to text now, not upload, so only half of them :) [04:10:01] yeah, upload is ipsec free \o/ [04:10:18] anyways, if you have it in you to attack some of this, go for it. or bug ema when he's on to work on it. [04:10:36] we need to start pooling stuff in as it's available as best we can, so they can make progress on depool/decom of legacy hosts too [04:12:23] it will necessarily be a bit more aggressive than we usually go on the replacement process of swapping in new nodes and pulling out old, in terms of chash'd cache contents getting effectively-invalidated by the churn [04:12:38] as long as we don't saturate transport refilling the cache, should be ok :) [04:14:09] I can wait on ema to pool/depool stuff, but I'll try to get them up & running (on a depooled state) ASAP [04:14:18] thanks! [04:14:22] np [04:23:18] so cp3055 won't boot via PXE [04:23:32] PXE-E61: Media test failure, check cable [04:23:32] PXE-M0F: Exiting Broadcom PXE ROM. [04:23:32] Booting from Hard drive C: [04:24:15] hmm let's check the NIC BIOS.... [04:32:26] awesome [04:32:40] it's also possible I need to enable the interfaces on the switch or something dumb like that, looking [04:32:48] (did set the vlan stuff and descriptions) [04:34:12] so according to librenms it went down ~10 minutes ago: https://librenms.wikimedia.org/device/device=178/tab=port/port=19252/view=events/ [04:34:59] on the switch side, of the 9 hosts that are set up, only cp3055 has linkdown [04:35:03] I think that the Boot protocol set to None instead of PXE is the culprit [04:35:04] so the rest may work [04:35:09] ok [04:35:17] yeah... but I think that's because I got cp3055 on the broadcom BIOS right now [04:35:18] :) [04:36:16] Last flapped : 2019-10-24 12:45:14 UTC (00:10:46 ago) [04:36:40] I'm guessing the timezone is wrong on the switch heh [04:36:55] uh... kinda a:) [04:36:58] well the time in general [04:41:43] yeah none of the NTP associations are working from the switch either [04:41:50] minor details, X can sort it out later :) [04:42:08] I see linkup for cp3055 on the switch side now [04:42:41] no luck though [04:42:53] the NIC trying to boot is "Booting from BRCM MBA Slot 0400 v20.14.0" [04:43:03] same error as before: "PXE-E61: Media test failure, check cable" [04:43:10] hmmm [04:43:42] try the next host and see if it's going to be systemic or just that one? [04:43:53] yep [04:44:00] I'll hit cp3056 [04:44:14] maybe something simple like onboard eth vs cards and which one it's trying to use, etc [04:44:16] but 3055 is the only one marked on the phab task as ready for us to install [04:44:27] yeah... it could be as simple as that [04:44:57] I think all 6 are in the same state, pp probably just didn't want to check a billion checkboxes [04:45:48] I think usually in these boxes, some bios setting disables the onboard ethernet so that the add-in 10G card can be the "primary" [04:45:55] maybe not done on one or all of them, yet [04:45:59] BTW, librenms still shows xe-5/0/15 (cp3055) as down [04:46:17] last event: 2019-10-24 04:21:36 xe-5/0/15 ifOperStatus: up -> down [04:46:58] I got the feeling that 3056 won't boot either... [04:47:08] 2019-10-24 04:46:36 xe-5/0/16 ifOperStatus: up -> down --> cause of this [04:47:36] heh [04:47:46] so the link only goes offline when you try to PXE? :) [04:47:50] apparently [04:48:18] yeah, same issue on 3056 [04:48:24] exact same message [04:48:25] Booting from BRCM MBA Slot 0400 v20.14.0 [04:48:27] yeah 3056 port says: Last flapped : 2019-10-24 13:07:36 UTC (00:00:42 ago) [04:48:31] PXE-E61: Media test failure, check cable [04:48:31] PXE-M0F: Exiting Broadcom PXE ROM. [04:48:36] but it's back up now [04:49:04] weird [04:49:08] let me go poke around in the bios console stuff, maybe something will ring a bell [04:49:11] I'll hit 3057 [04:49:39] sure [04:53:58] bblack: do we have somewhere the MAC of the main NIC for the cp305x boxes? [04:54:25] you mean the onboard 1G we don't use? [04:54:25] cause I'm seeing 4 Broadcom ports on the BIOS... only one has the PXE boot enabled [04:54:50] I think we did dual-port 10G cards [04:54:53] nope.. the one that's actually configured and linked [04:55:04] so probably it's onboard 2x1G and card 2x10G, and the 2x1G should be disabled but aren't [04:55:07] to check that's the one with PXE enabled [04:55:24] the install_server stuff has the 10G macs, supposedly [04:55:29] oh right [04:55:31] the DHCP [04:55:31] I got them from broadcom ctrl+S on the consoles though [04:55:31] sorry [04:55:41] which one? [04:55:49] cause we got two Ctrl+S BIOS in these boxes [04:55:56] I took the first one, that's usually the first port [04:56:01] right [04:56:06] oh wait [04:56:08] the first one first port is the one with PXE enabled [04:56:16] I meant the first of the two macaddrs shown in ctrl+S [04:56:26] I also took the first ctrl+S prompt, but you're saying there's two? [04:56:31] yes [04:56:43] sec I have 3057 console going now, will figure it out [04:56:49] let me try to get a capture for you [04:57:44] ok yeah [04:58:04] so the very first Ctrl+S prompt.... that's a dual 1G onboard, I can tell because it identifies as BCM5720 [04:58:11] (which is a dual 1G chip) [04:58:14] oh [04:58:20] then PXE is wrongly configured at least on cp3055 [04:58:34] on all of them, because I put in the DHCP install_server data from those 1G screens :) [04:58:41] it's trying to boot from the first 1G box [04:58:44] s/box/port [04:58:49] but first, let me figure out the whole "disable the onboard" mess [04:58:55] then we can get the right macaddrs after that [04:59:32] is the mac address and setting MBA Configuration --> Boot Protocol to PXE on the right port [05:00:50] yeah F2 Bios -> Integrated Devices -> Onboard NIC1/NIC2 was enabled, set now to disabled on cp3057, let's see what that changes here [05:02:18] I'm hoping it removes the extra ctrl+S firmware thing entirely [05:02:20] so that won't boot cause the other broadcom NIC has PXE disabled [05:02:26] yeah but we can fix that [05:02:28] yup [05:02:38] I'm just trying to find the right set of steps to repro on all of them [05:03:35] there we go [05:03:50] so from the present state of all of these hosts, it's: [05:04:06] F2 Bios -> Integrated Devices -> Onboard NIC1/NIC2 -> Set to "Disable (OS)" [05:04:09] save + reboot [05:04:27] now take the very first Ctrl+S prompt, and it now shows the dual 10G card instead, with new macaddrs, where we need to set up PXE... [05:04:35] hmmm right [05:04:44] on that Ctrl+S prompt what is reporting Link Status? [05:04:49] cause on cp3055 is reporting disconnected [05:04:55] and I'm still getting the same PXE error [05:05:04] donno yet [05:05:10] even when now it's reporting to try to boot from the 10G NIC: "Booting from BRCM MBA Slot 3B00 v214.0.218.0" [05:05:26] note the 3B00 VS the 0400 I reported before [05:07:08] it's also possible pp plugged the onboards into the switch rather than the 10Gs [05:07:12] still digging [05:07:31] hmm the switch reports the mac address on the other side of the port? [05:07:37] not really [05:07:45] unless you're sending traffic, then you can kinda see [05:08:08] but we can check link speed :) [05:08:21] so the 10G BIOS reports the link status [05:08:29] yeah I see that [05:08:35] the switch says says 10G speed though [05:10:09] anyways, disabling the onboard 1G in bios is certainly *a* step that needs taking on all of these [05:10:11] right... [05:10:15] makes me wonder if any of the bios settings were done yet [05:10:26] at least on cp3055 the BIOS reports link on the second port of the 10G NIC [05:12:10] we need them two switch the cable physically I'm afraid [05:12:27] oh it does? [05:12:36] on mine I didn't see link on either, from the ctrl+s info [05:12:52] F2 --> Device Settings [05:12:56] I could see it there [05:12:57] but it makes sense with how the switch looks [05:13:11] I've disabled the Embedded NIC on cp3055 [05:13:24] going for the same on cp3056 and check the linked port there as well [05:13:51] if only someone had invented a way for the physical ports and all logical names in bios and linux to be aligned so that these mistakes never happen. [05:14:00] enpsf03isa0i3maoz0z0 [05:14:02] ahahahah [05:14:10] <3 gotta love those predictable names [05:14:19] we have the same issues every fricking time [05:14:21] *sigh* [05:15:08] in addition to cp3055-60, there's also lvs3006, ganeti3002, and dns3002 [05:15:53] (total 9 machines that are powered up and in the same rack together. they're probably all in the same approximate state, and therefore all have the embedded NICs turned on which need disabling, and I recorded the wrong (1G) macaddr for them all in install_server dhcp settings. [05:16:04] and then probably they all need a cable move too once EU gets back onsite [05:16:10] ack [05:16:28] I'm gonna go get some sleep :) [05:16:43] I disabled onboard on cp3057, and turned on PXE on the first 10G port [05:16:49] cool [05:16:53] I'll add new boxes on the phab task [05:17:04] and tick them as I go from server to server [05:17:25] there's rack/setup/install tasks for each of the node types [05:17:50] all can be found under the meta-task https://phabricator.wikimedia.org/T235805 [05:20:38] same thing in cp3056... link on the second port of the 10G NIC :) [06:12:13] all (available) servers done, MACs replaced on puppet and waiting for dcops to switch the ethernet calbes [06:12:15] *cables [07:35:06] ema: I guess that ats-backend needs some tuning for the new cp hosts on esams [07:35:41] right now it's complaining about sda3 [07:35:47] Oct 24 07:35:00 cp3055 traffic_manager[33827]: [Oct 24 07:35:00.688] {0x2b4de5175180} WARNING: unable to open '/dev/sda3': No such file or directory [07:35:47] Oct 24 07:35:00 cp3055 traffic_manager[33827]: [Oct 24 07:35:00.688] {0x2b4de5175180} WARNING: could not initialize storage "/dev/sda3" [file not found] [07:38:23] ema: https://gerrit.wikimedia.org/r/#/c/545706/ something like this? [07:50:30] now trafficserver is happier on cp3055 [07:59:29] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It is indeed unusual for this to apply to specific pages of a small PDF, even moreso fo... [08:00:53] ema: could you check https://gerrit.wikimedia.org/r/c/operations/puppet/+/545711/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/545712/ ? thanks! [08:05:36] vgutierrez: yeah, storage config on cp3055 looks good [08:05:59] yey, I've replicated that for the upload hosts [08:06:06] and added the varnish storage parts for the text ones [08:06:36] also please take a look to https://gerrit.wikimedia.org/r/c/operations/puppet/+/545691 [08:06:45] that's from bbl.ack [08:07:12] I think it's sane, but you're more familiar with that [08:09:53] we could deploy the new text hosts as text_ats actually [08:10:26] why bother installing varnish on them just to reimage in a few days? [08:11:29] hmmm [08:11:32] up to you [08:11:47] but those should get prod traffic today [08:12:14] from what Brandon said before [08:13:21] right, let's not rush things then. +1 [08:13:27] merging [08:15:51] we got an error during the debian installation on cp3055, I'm imaging dns3002 to see if it's related to the cp3055 nvme driver or hw related somehow [08:16:23] oh, but 3055 seems to be alive and kicking? [08:18:05] yes [08:18:11] I acked the error [08:18:16] and it continued the installation [08:18:22] oh I see [08:18:30] what was it? [08:19:17] it crashed on the late_command.sh execution [08:19:58] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['dns3002.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reima... [08:20:44] let 's see if dns3002 now boots via PXE after fixing the FQDN... [08:21:09] XioNoX: so.. the lvs boxes.. the IPs are already on the DNS [08:21:25] 3006 is 10.20.0.16 && 2620:0:862:102:10:20:0:16 [08:21:45] .15 and .17 for 3005 and 3007 respectively [08:21:57] vgutierrez: cool, added 3006 [08:22:00] adding the other two [08:22:09] to the router side of bgp [08:22:17] so when you setup pybal it should come up [08:22:22] awesome [08:22:51] yey... dns3002 is booting now [08:22:52] cool [08:23:06] let me trigger lvs3006 as well [08:23:36] hmmm [08:25:13] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It seems like the ghostscript command used by Thumbor outputs some errors to stdout tha... [08:25:43] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/545696 looks good? [08:25:53] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) [08:26:38] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) a:03Gilles [08:27:08] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) I will try looking at this in my spare time, but can't promise a... [08:28:37] vgutierrez: it does [08:28:42] cool [08:28:46] we need to fix the icinga config error [08:28:52] cause icinga is not adding new hosts [08:29:07] i.e cp3055 is not showing there [08:30:40] xionox already merged the change.. [08:30:41] XioNoX pushed a fix [08:30:44] yup [08:30:51] should recover with the next puppet run on icinga1001 [08:31:14] I'm triggering one right now [08:31:25] ack [08:31:29] I manually run it after merging too [08:31:54] but it's something more complex I think, like puppet needs to run on the host first [08:31:56] then on icinga [08:32:01] or something like that [08:32:58] hmmm the puppet run on icinga was almost a NOOP right now [08:33:03] nothing related to icinga itself [08:33:04] can I help? tl;dr of th ebacklog? [08:33:26] icinga config is broken apparently [08:33:47] Error: Could not find any hostgroup matching 'asw2-esams.mgmt.esams.wmnet' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 7534) [08:35:18] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Elitre) In the meantime, you have all my appreciation. [08:36:36] vgutierrez, XioNoX: the hostgroup is asw2-esams, not FQDN [08:37:00] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3002.wikimedia.org'] ` Of which those **FAILED**: ` ['dns3002.wikimedia.org'] ` [08:37:56] lovely :) [08:38:23] the puppet error on dns3002 is the typical Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Package[ntp] is already declared at (file: /etc/puppet/modules/standard/manifests/ntp/timesyncd.pp, line: 7); cannot redeclare (file: [08:38:23] /etc/puppet/modules/ntp/manifests/daemon.pp, line: 3) (file: /etc/puppet/modules/ntp/manifests/daemon.pp, line: 3, column: 5) (file: /etc/puppet/modules/profile/manifests/ntp.pp, line: 81) on node dns3002.wikimedia.org [08:38:46] * vgutierrez trying to remember how to fix that... [08:39:11] use require_package ? [08:39:34] :) [08:40:28] I'd say dns3002 is missing from ntp_peers hiera structure [08:40:29] who's fixing icinga? [08:41:25] vgutierrez: I've a fix in mind for that, let me bring you an example [08:41:34] uh? [08:41:41] it's a "config" missing issue [08:41:45] not a puppet code issue itself [08:42:14] isn't that gathered dynamically via puppetdb? [08:42:26] with query_nodes() [08:43:19] ah, no, it's harcoded :( [08:43:25] it should be dymanic IMHO :D [08:44:12] I'd say the culprit is https://gerrit.wikimedia.org/r/#/c/545744/ [08:45:33] yeah sure [08:46:07] btw if you tail the cumin logs during the reimage (path at the top of the output) you can see the puppet run and fix things before the timeout triggers [08:46:38] a race against the machine... [08:46:40] ;P [08:48:37] lol [08:49:37] ema: https://gerrit.wikimedia.org/r/#/c/545752/ still applies to ATS? /cc moritzm [08:49:58] asking cause ATS use the raw device instead of a filesystem [08:51:04] if not, we need to alternative change the partman recipe for the new esams caches (as they'll fail with the current late-command handling for cp hosts) [08:51:26] well... for text still applies [08:51:33] at least for a few days/weeks [08:52:33] vgutierrez: we don't need to partition the disk, no [08:53:08] but maybe it's worth it to revisit it after we migrate everything to ats-be [08:56:46] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['lvs3006.esams.wmnet'] ` The log can be found in `/var/log/wm... [08:57:52] volans: so.. what needs to be fixed on icinga regarding asw2-esams? [08:57:56] XioNoX, vgutierrez, moritzm: I've an errand to run, icinga config is till broken, that means no new hosts added, and related downtime, etc... [08:58:02] was just writing :) [08:58:22] cause I do need icinga [08:59:11] it's set as parents [08:59:16] not sure where in the code though [08:59:27] for cp3055 [08:59:32] hostgroups cache_upload_esams,asw2-esams.mgmt.esams.wmnet [08:59:35] parents asw2-esams.mgmt.esams.wmnet [09:00:37] this is the generated config [09:01:26] modules/monitoring/manifests/host.pp: $real_parents = $facts['lldp_parent'] [09:02:51] vgutierrez: [09:02:51] cp3055 0 ~$ sudo facter -p lldp_parent [09:02:51] asw2-esams.mgmt.esams.wmnet [09:03:06] ack [09:03:14] cp2024 0 ~$ sudo facter -p lldp_parent [09:03:14] asw-d-codfw [09:03:18] why I don't know :) [09:03:20] but that's the culprit [09:03:31] uh [09:04:14] sorry, gotta go afk for an errand for a bit [09:04:26] compare it also with https://puppetboard.wikimedia.org/fact/lldp_parent [09:05:03] cp3055 and dns3002 came up with the FQDN [09:05:36] XioNoX: that could be related to the LLDP config on asw2? [09:05:44] if needed we can create also teh hostgroup with the FQDN in hieradata/common/monitoring.yaml if we're migrating to it, but would be better to understand why it's different [09:05:54] XioNoX: iding itself as the FQDN instead of the base hostname? [09:05:59] possible [09:06:19] vgutierrez: can you run a lldpctl from an host on the old and new switch stack? [09:06:58] old one: SysName: asw-esams [09:07:17] new one: SysName: asw2-esams.mgmt.esams.wmnet [09:07:30] interesting [09:07:41] * volans errand, bbiab [09:08:18] so the LLDP config is the same on both sides [09:08:28] different junos version? [09:09:14] 14.x VS 18.x [09:09:18] (lldp told me) [09:09:33] it's possible yeah [09:10:08] let me know if there is yet anothe knob to tweak [09:10:14] er, let me check* :) [09:10:24] ack :) [09:10:28] vgutierrez: I've merged the late-command patch and ran puppet on install*, cp3055 should probably be reimaged so that it matches the 3036 and later? [09:10:46] 3056? right <3 [09:11:15] vgutierrez: fyi I think akosiaris implemented that LLDP to icinga parent feature [09:11:39] in case nothing can be done on the switch side and it needs to be fixed on the puppet side [09:12:07] so as volans mentioned we could add the FQDN as the group name on icinga [09:12:28] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3055.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [09:12:29] moritzm: I'm triggering a reimage now [09:12:59] hmmm maybe I was too fast [09:13:10] moritzm: I need to trigger a puppet run on install1002, right? [09:13:24] https://lists.gt.net/nsp/juniper/66466 [09:14:00] I ran puppet on install* via Cumin already [09:14:11] so if you don't see anything changed in puppet output, that's fine :-) [09:14:54] oh ok [09:14:54] :) [09:15:02] vgutierrez: so from that thread, it's now the new behavior for junos to do it that way [09:15:02] thx [09:15:08] ack [09:15:18] so let's change the group name to the FQDN then? [09:15:30] wfm [09:15:46] I can also remove the domain name on the switch side [09:16:18] but it would be not standard on our side (vs. all other devices) [09:17:15] dunno what are the implications of the change in your side TBH [09:18:49] none as far as I know other than config differences from our standards [09:19:45] if the implcation on the puppet side are more than a variable change let's do it on the switch side, otherwise on the switch [09:20:14] bah I can't even type normal sentences :) [09:20:22] but you understand me [09:22:58] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3006.esams.wmnet'] ` and were **ALL** successful. [09:25:18] XioNoX: https://gerrit.wikimedia.org/r/#/c/545782/ [09:25:27] moritzm: are you ok with the approach? :) [09:27:22] akosiaris: ^^ [09:29:29] it seems fine to me, slowly our network gear will converge to 18.x and we need to start at some point [09:29:46] btw, cp3055 reimaged succesfully this time, thx <3 [09:30:34] vgutierrez: hm I'm also wondering if https://github.com/wikimedia/puppet/blob/847cc7412b66886b4992b3cadd2db30d9c95afd0/modules/netops/manifests/monitoring.pp#L116 should change too [09:30:47] cool :-) [09:30:57] otherwise how does it "know" that this is the parent host [09:38:59] hmm ack [09:39:02] I'll fix that [09:43:37] XioNoX: change updated [09:46:41] XioNoX: Oct 24 09:45:52 lvs3006 pybal[11502]: [bgp.BGPFactory@0x7f66d7bde488] INFO: BGP session established for ASN 64600 peer 91.198.174.244 [09:46:48] XioNoX: Oct 24 09:45:52 lvs3006 pybal[11502]: [bgp.BGPFactory@0x7f66d7c275f0] INFO: BGP session established for ASN 64600 peer 91.198.174.245 [09:46:48] nice! [09:46:53] yup [09:47:09] dns_rec monitored services are struggling though [09:58:50] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3055.esams.wmnet'] ` and were **ALL** successful. [10:02:21] Oct 24 09:59:51 lvs3006 pybal[11502]: [dns_rec_53 IdleConnection] WARN: maerlant.wikimedia.org (enabled/partially up/pooled): Connection to 91.198.174.122:53 failed. [10:02:38] ^^ XioNoX, that's ferm related or router related? [10:03:49] iptables seems fine [10:03:56] 42273 2771K ACCEPT udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53 [10:04:51] vgutierrez: what's the issue? [10:05:06] lvs3006 to 91.198.174.122 ? [10:05:09] yep [10:05:27] and lvs3006 to 91.198.174.106 [10:05:30] (nescio) [10:05:35] should I be able to ssh to lvs3006? [10:05:39] yes [10:05:43] I'm in via ssh right now [10:05:45] I can't ssh [10:05:48] uh? [10:05:53] me neither [10:06:25] ping -4 does not work [10:06:28] IPv6 does [10:06:37] bast3002:~$ nc -zv lvs3006.esams.wmnet 22 stalls [10:06:41] I've just opened another session to lvs3006.esams.wmnet [10:06:44] via bast5001 [10:07:01] vgutierrez: IPv4 is broken (10.20.0.16) [10:07:54] hmmm [10:08:43] * volans back, sorry took longer than expected [10:08:59] so, the other option is to do it in facter [10:09:06] and remove the FQDN part [10:09:19] 64 bytes from lvs3006.esams.wmnet (10.20.0.16): icmp_seq=1 ttl=59 time=318 ms [10:09:28] so... it's broken from bast3002 [10:09:45] it's already a custom fact [10:09:56] vgutierrez: IPv4 is broken from maerlant as well [10:10:03] err [10:10:08] it's a issue only within esams [10:10:28] wtf? :) [10:11:04] bast1002 can ping lvs3006 on ipv4 and ipv6 as well [10:11:10] but bast3002 can't [10:12:28] lvs3006 can ping cp3030 just fine [10:12:51] I can send a patch for the lldp stuff [10:13:17] volans: there is a patch already with the opposite approach [10:13:34] changing the hostgroup in hiera? [10:13:37] yes [10:13:56] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/545782 [10:14:04] ok, that works too, *but* note that all lldp facters use that value [10:14:25] lldp, lldp_neighbors, lldp_parent [10:14:36] so depending where they are used in puppet code [10:14:45] we might have other weird behaviours and breakage [10:15:33] from a quick git grep it seems used only for icinga [10:15:38] but wanted to mention it [10:16:04] hmmm lvs3006 it's on a weird state [10:16:07] let me reboot it [10:16:32] cause our lvs boxes have netfilter banned [10:16:34] right? [10:16:58] correct [10:17:06] spare --> lvs transition [10:17:55]