[04:02:17] bblack: I can assist installing cp3x hosts if you need it [04:03:34] yeah I was gonna bug you and/or ema about it [04:03:54] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545691/ has the data defs [04:04:14] only cp3055-60 (6 hosts, 3 from each of text and upload) exist physically yet. in theory those 6 are installable now. [04:04:47] there's the whole usual thing to manage about icinga alerts and ipsec timing and enabling the cache::nodes entry at the right time (commented out in the patch), etc... to get them basically installed up into a depooled state [04:05:20] I think we're trying to reach that state today if we can, so we can try to pool them in and depool some of the old ones and decom them (and then hopefully get another batch or two installed soon and repeat) [04:05:53] I've puppeted the installer hosts, in theory they have all the dns/dhcp data ready for it [04:06:33] ook [04:07:31] lvs situation is similar. Only lvs3006 (the new upload lvs) is physically installed yet. Will need arzhel support to set up the router side and bring it in as another secondary first to validate it before switching it into the primary role, etc. [04:08:17] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545696/ was my draft attempt to get those set up in puppet (as non-primaries initially) [04:09:30] I guess the stuff about ipsec BS only applies to text now, not upload, so only half of them :) [04:10:01] yeah, upload is ipsec free \o/ [04:10:18] anyways, if you have it in you to attack some of this, go for it. or bug ema when he's on to work on it. [04:10:36] we need to start pooling stuff in as it's available as best we can, so they can make progress on depool/decom of legacy hosts too [04:12:23] it will necessarily be a bit more aggressive than we usually go on the replacement process of swapping in new nodes and pulling out old, in terms of chash'd cache contents getting effectively-invalidated by the churn [04:12:38] as long as we don't saturate transport refilling the cache, should be ok :) [04:14:09] I can wait on ema to pool/depool stuff, but I'll try to get them up & running (on a depooled state) ASAP [04:14:18] thanks! [04:14:22] np [04:23:18] so cp3055 won't boot via PXE [04:23:32] PXE-E61: Media test failure, check cable [04:23:32] PXE-M0F: Exiting Broadcom PXE ROM. [04:23:32] Booting from Hard drive C: [04:24:15] hmm let's check the NIC BIOS.... [04:32:26] awesome [04:32:40] it's also possible I need to enable the interfaces on the switch or something dumb like that, looking [04:32:48] (did set the vlan stuff and descriptions) [04:34:12] so according to librenms it went down ~10 minutes ago: https://librenms.wikimedia.org/device/device=178/tab=port/port=19252/view=events/ [04:34:59] on the switch side, of the 9 hosts that are set up, only cp3055 has linkdown [04:35:03] I think that the Boot protocol set to None instead of PXE is the culprit [04:35:04] so the rest may work [04:35:09] ok [04:35:17] yeah... but I think that's because I got cp3055 on the broadcom BIOS right now [04:35:18] :) [04:36:16] Last flapped : 2019-10-24 12:45:14 UTC (00:10:46 ago) [04:36:40] I'm guessing the timezone is wrong on the switch heh [04:36:55] uh... kinda a:) [04:36:58] well the time in general [04:41:43] yeah none of the NTP associations are working from the switch either [04:41:50] minor details, X can sort it out later :) [04:42:08] I see linkup for cp3055 on the switch side now [04:42:41] no luck though [04:42:53] the NIC trying to boot is "Booting from BRCM MBA Slot 0400 v20.14.0" [04:43:03] same error as before: "PXE-E61: Media test failure, check cable" [04:43:10] hmmm [04:43:42] try the next host and see if it's going to be systemic or just that one? [04:43:53] yep [04:44:00] I'll hit cp3056 [04:44:14] maybe something simple like onboard eth vs cards and which one it's trying to use, etc [04:44:16] but 3055 is the only one marked on the phab task as ready for us to install [04:44:27] yeah... it could be as simple as that [04:44:57] I think all 6 are in the same state, pp probably just didn't want to check a billion checkboxes [04:45:48] I think usually in these boxes, some bios setting disables the onboard ethernet so that the add-in 10G card can be the "primary" [04:45:55] maybe not done on one or all of them, yet [04:45:59] BTW, librenms still shows xe-5/0/15 (cp3055) as down [04:46:17] last event: 2019-10-24 04:21:36 xe-5/0/15 ifOperStatus: up -> down [04:46:58] I got the feeling that 3056 won't boot either... [04:47:08] 2019-10-24 04:46:36 xe-5/0/16 ifOperStatus: up -> down --> cause of this [04:47:36] heh [04:47:46] so the link only goes offline when you try to PXE? :) [04:47:50] apparently [04:48:18] yeah, same issue on 3056 [04:48:24] exact same message [04:48:25] Booting from BRCM MBA Slot 0400 v20.14.0 [04:48:27] yeah 3056 port says: Last flapped : 2019-10-24 13:07:36 UTC (00:00:42 ago) [04:48:31] PXE-E61: Media test failure, check cable [04:48:31] PXE-M0F: Exiting Broadcom PXE ROM. [04:48:36] but it's back up now [04:49:04] weird [04:49:08] let me go poke around in the bios console stuff, maybe something will ring a bell [04:49:11] I'll hit 3057 [04:49:39] sure [04:53:58] bblack: do we have somewhere the MAC of the main NIC for the cp305x boxes? [04:54:25] you mean the onboard 1G we don't use? [04:54:25] cause I'm seeing 4 Broadcom ports on the BIOS... only one has the PXE boot enabled [04:54:50] I think we did dual-port 10G cards [04:54:53] nope.. the one that's actually configured and linked [04:55:04] so probably it's onboard 2x1G and card 2x10G, and the 2x1G should be disabled but aren't [04:55:07] to check that's the one with PXE enabled [04:55:24] the install_server stuff has the 10G macs, supposedly [04:55:29] oh right [04:55:31] the DHCP [04:55:31] I got them from broadcom ctrl+S on the consoles though [04:55:31] sorry [04:55:41] which one? [04:55:49] cause we got two Ctrl+S BIOS in these boxes [04:55:56] I took the first one, that's usually the first port [04:56:01] right [04:56:06] oh wait [04:56:08] the first one first port is the one with PXE enabled [04:56:16] I meant the first of the two macaddrs shown in ctrl+S [04:56:26] I also took the first ctrl+S prompt, but you're saying there's two? [04:56:31] yes [04:56:43] sec I have 3057 console going now, will figure it out [04:56:49] let me try to get a capture for you [04:57:44] ok yeah [04:58:04] so the very first Ctrl+S prompt.... that's a dual 1G onboard, I can tell because it identifies as BCM5720 [04:58:11] (which is a dual 1G chip) [04:58:14] oh [04:58:20] then PXE is wrongly configured at least on cp3055 [04:58:34] on all of them, because I put in the DHCP install_server data from those 1G screens :) [04:58:41] it's trying to boot from the first 1G box [04:58:44] s/box/port [04:58:49] but first, let me figure out the whole "disable the onboard" mess [04:58:55] then we can get the right macaddrs after that [04:59:32] is the mac address and setting MBA Configuration --> Boot Protocol to PXE on the right port [05:00:50] yeah F2 Bios -> Integrated Devices -> Onboard NIC1/NIC2 was enabled, set now to disabled on cp3057, let's see what that changes here [05:02:18] I'm hoping it removes the extra ctrl+S firmware thing entirely [05:02:20] so that won't boot cause the other broadcom NIC has PXE disabled [05:02:26] yeah but we can fix that [05:02:28] yup [05:02:38] I'm just trying to find the right set of steps to repro on all of them [05:03:35] there we go [05:03:50] so from the present state of all of these hosts, it's: [05:04:06] F2 Bios -> Integrated Devices -> Onboard NIC1/NIC2 -> Set to "Disable (OS)" [05:04:09] save + reboot [05:04:27] now take the very first Ctrl+S prompt, and it now shows the dual 10G card instead, with new macaddrs, where we need to set up PXE... [05:04:35] hmmm right [05:04:44] on that Ctrl+S prompt what is reporting Link Status? [05:04:49] cause on cp3055 is reporting disconnected [05:04:55] and I'm still getting the same PXE error [05:05:04] donno yet [05:05:10] even when now it's reporting to try to boot from the 10G NIC: "Booting from BRCM MBA Slot 3B00 v214.0.218.0" [05:05:26] note the 3B00 VS the 0400 I reported before [05:07:08] it's also possible pp plugged the onboards into the switch rather than the 10Gs [05:07:12] still digging [05:07:31] hmm the switch reports the mac address on the other side of the port? [05:07:37] not really [05:07:45] unless you're sending traffic, then you can kinda see [05:08:08] but we can check link speed :) [05:08:21] so the 10G BIOS reports the link status [05:08:29] yeah I see that [05:08:35] the switch says says 10G speed though [05:10:09] anyways, disabling the onboard 1G in bios is certainly *a* step that needs taking on all of these [05:10:11] right... [05:10:15] makes me wonder if any of the bios settings were done yet [05:10:26] at least on cp3055 the BIOS reports link on the second port of the 10G NIC [05:12:10] we need them two switch the cable physically I'm afraid [05:12:27] oh it does? [05:12:36] on mine I didn't see link on either, from the ctrl+s info [05:12:52] F2 --> Device Settings [05:12:56] I could see it there [05:12:57] but it makes sense with how the switch looks [05:13:11] I've disabled the Embedded NIC on cp3055 [05:13:24] going for the same on cp3056 and check the linked port there as well [05:13:51] if only someone had invented a way for the physical ports and all logical names in bios and linux to be aligned so that these mistakes never happen. [05:14:00] enpsf03isa0i3maoz0z0 [05:14:02] ahahahah [05:14:10] <3 gotta love those predictable names [05:14:19] we have the same issues every fricking time [05:14:21] *sigh* [05:15:08] in addition to cp3055-60, there's also lvs3006, ganeti3002, and dns3002 [05:15:53] (total 9 machines that are powered up and in the same rack together. they're probably all in the same approximate state, and therefore all have the embedded NICs turned on which need disabling, and I recorded the wrong (1G) macaddr for them all in install_server dhcp settings. [05:16:04] and then probably they all need a cable move too once EU gets back onsite [05:16:10] ack [05:16:28] I'm gonna go get some sleep :) [05:16:43] I disabled onboard on cp3057, and turned on PXE on the first 10G port [05:16:49] cool [05:16:53] I'll add new boxes on the phab task [05:17:04] and tick them as I go from server to server [05:17:25] there's rack/setup/install tasks for each of the node types [05:17:50] all can be found under the meta-task https://phabricator.wikimedia.org/T235805 [05:20:38] same thing in cp3056... link on the second port of the 10G NIC :) [06:12:13] all (available) servers done, MACs replaced on puppet and waiting for dcops to switch the ethernet calbes [06:12:15] *cables [07:35:06] ema: I guess that ats-backend needs some tuning for the new cp hosts on esams [07:35:41] right now it's complaining about sda3 [07:35:47] Oct 24 07:35:00 cp3055 traffic_manager[33827]: [Oct 24 07:35:00.688] {0x2b4de5175180} WARNING: unable to open '/dev/sda3': No such file or directory [07:35:47] Oct 24 07:35:00 cp3055 traffic_manager[33827]: [Oct 24 07:35:00.688] {0x2b4de5175180} WARNING: could not initialize storage "/dev/sda3" [file not found] [07:38:23] ema: https://gerrit.wikimedia.org/r/#/c/545706/ something like this? [07:50:30] now trafficserver is happier on cp3055 [07:59:29] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It is indeed unusual for this to apply to specific pages of a small PDF, even moreso fo... [08:00:53] ema: could you check https://gerrit.wikimedia.org/r/c/operations/puppet/+/545711/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/545712/ ? thanks! [08:05:36] vgutierrez: yeah, storage config on cp3055 looks good [08:05:59] yey, I've replicated that for the upload hosts [08:06:06] and added the varnish storage parts for the text ones [08:06:36] also please take a look to https://gerrit.wikimedia.org/r/c/operations/puppet/+/545691 [08:06:45] that's from bbl.ack [08:07:12] I think it's sane, but you're more familiar with that [08:09:53] we could deploy the new text hosts as text_ats actually [08:10:26] why bother installing varnish on them just to reimage in a few days? [08:11:29] hmmm [08:11:32] up to you [08:11:47] but those should get prod traffic today [08:12:14] from what Brandon said before [08:13:21] right, let's not rush things then. +1 [08:13:27] merging [08:15:51] we got an error during the debian installation on cp3055, I'm imaging dns3002 to see if it's related to the cp3055 nvme driver or hw related somehow [08:16:23] oh, but 3055 seems to be alive and kicking? [08:18:05] yes [08:18:11] I acked the error [08:18:16] and it continued the installation [08:18:22] oh I see [08:18:30] what was it? [08:19:17] it crashed on the late_command.sh execution [08:19:58] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['dns3002.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reima... [08:20:44] let 's see if dns3002 now boots via PXE after fixing the FQDN... [08:21:09] XioNoX: so.. the lvs boxes.. the IPs are already on the DNS [08:21:25] 3006 is 10.20.0.16 && 2620:0:862:102:10:20:0:16 [08:21:45] .15 and .17 for 3005 and 3007 respectively [08:21:57] vgutierrez: cool, added 3006 [08:22:00] adding the other two [08:22:09] to the router side of bgp [08:22:17] so when you setup pybal it should come up [08:22:22] awesome [08:22:51] yey... dns3002 is booting now [08:22:52] cool [08:23:06] let me trigger lvs3006 as well [08:23:36] hmmm [08:25:13] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It seems like the ghostscript command used by Thumbor outputs some errors to stdout tha... [08:25:43] ema: https://gerrit.wikimedia.org/r/c/operations/puppet/+/545696 looks good? [08:25:53] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) [08:26:38] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) a:03Gilles [08:27:08] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) I will try looking at this in my spare time, but can't promise a... [08:28:37] vgutierrez: it does [08:28:42] cool [08:28:46] we need to fix the icinga config error [08:28:52] cause icinga is not adding new hosts [08:29:07] i.e cp3055 is not showing there [08:30:40] xionox already merged the change.. [08:30:41] XioNoX pushed a fix [08:30:44] yup [08:30:51] should recover with the next puppet run on icinga1001 [08:31:14] I'm triggering one right now [08:31:25] ack [08:31:29] I manually run it after merging too [08:31:54] but it's something more complex I think, like puppet needs to run on the host first [08:31:56] then on icinga [08:32:01] or something like that [08:32:58] hmmm the puppet run on icinga was almost a NOOP right now [08:33:03] nothing related to icinga itself [08:33:04] can I help? tl;dr of th ebacklog? [08:33:26] icinga config is broken apparently [08:33:47] Error: Could not find any hostgroup matching 'asw2-esams.mgmt.esams.wmnet' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 7534) [08:35:18] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Elitre) In the meantime, you have all my appreciation. [08:36:36] vgutierrez, XioNoX: the hostgroup is asw2-esams, not FQDN [08:37:00] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3002.wikimedia.org'] ` Of which those **FAILED**: ` ['dns3002.wikimedia.org'] ` [08:37:56] lovely :) [08:38:23] the puppet error on dns3002 is the typical Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Package[ntp] is already declared at (file: /etc/puppet/modules/standard/manifests/ntp/timesyncd.pp, line: 7); cannot redeclare (file: [08:38:23] /etc/puppet/modules/ntp/manifests/daemon.pp, line: 3) (file: /etc/puppet/modules/ntp/manifests/daemon.pp, line: 3, column: 5) (file: /etc/puppet/modules/profile/manifests/ntp.pp, line: 81) on node dns3002.wikimedia.org [08:38:46] * vgutierrez trying to remember how to fix that... [08:39:11] use require_package ? [08:39:34] :) [08:40:28] I'd say dns3002 is missing from ntp_peers hiera structure [08:40:29] who's fixing icinga? [08:41:25] vgutierrez: I've a fix in mind for that, let me bring you an example [08:41:34] uh? [08:41:41] it's a "config" missing issue [08:41:45] not a puppet code issue itself [08:42:14] isn't that gathered dynamically via puppetdb? [08:42:26] with query_nodes() [08:43:19] ah, no, it's harcoded :( [08:43:25] it should be dymanic IMHO :D [08:44:12] I'd say the culprit is https://gerrit.wikimedia.org/r/#/c/545744/ [08:45:33] yeah sure [08:46:07] btw if you tail the cumin logs during the reimage (path at the top of the output) you can see the puppet run and fix things before the timeout triggers [08:46:38] a race against the machine... [08:46:40] ;P [08:48:37] lol [08:49:37] ema: https://gerrit.wikimedia.org/r/#/c/545752/ still applies to ATS? /cc moritzm [08:49:58] asking cause ATS use the raw device instead of a filesystem [08:51:04] if not, we need to alternative change the partman recipe for the new esams caches (as they'll fail with the current late-command handling for cp hosts) [08:51:26] well... for text still applies [08:51:33] at least for a few days/weeks [08:52:33] vgutierrez: we don't need to partition the disk, no [08:53:08] but maybe it's worth it to revisit it after we migrate everything to ats-be [08:56:46] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['lvs3006.esams.wmnet'] ` The log can be found in `/var/log/wm... [08:57:52] volans: so.. what needs to be fixed on icinga regarding asw2-esams? [08:57:56] XioNoX, vgutierrez, moritzm: I've an errand to run, icinga config is till broken, that means no new hosts added, and related downtime, etc... [08:58:02] was just writing :) [08:58:22] cause I do need icinga [08:59:11] it's set as parents [08:59:16] not sure where in the code though [08:59:27] for cp3055 [08:59:32] hostgroups cache_upload_esams,asw2-esams.mgmt.esams.wmnet [08:59:35] parents asw2-esams.mgmt.esams.wmnet [09:00:37] this is the generated config [09:01:26] modules/monitoring/manifests/host.pp: $real_parents = $facts['lldp_parent'] [09:02:51] vgutierrez: [09:02:51] cp3055 0 ~$ sudo facter -p lldp_parent [09:02:51] asw2-esams.mgmt.esams.wmnet [09:03:06] ack [09:03:14] cp2024 0 ~$ sudo facter -p lldp_parent [09:03:14] asw-d-codfw [09:03:18] why I don't know :) [09:03:20] but that's the culprit [09:03:31] uh [09:04:14] sorry, gotta go afk for an errand for a bit [09:04:26] compare it also with https://puppetboard.wikimedia.org/fact/lldp_parent [09:05:03] cp3055 and dns3002 came up with the FQDN [09:05:36] XioNoX: that could be related to the LLDP config on asw2? [09:05:44] if needed we can create also teh hostgroup with the FQDN in hieradata/common/monitoring.yaml if we're migrating to it, but would be better to understand why it's different [09:05:54] XioNoX: iding itself as the FQDN instead of the base hostname? [09:05:59] possible [09:06:19] vgutierrez: can you run a lldpctl from an host on the old and new switch stack? [09:06:58] old one: SysName: asw-esams [09:07:17] new one: SysName: asw2-esams.mgmt.esams.wmnet [09:07:30] interesting [09:07:41] * volans errand, bbiab [09:08:18] so the LLDP config is the same on both sides [09:08:28] different junos version? [09:09:14] 14.x VS 18.x [09:09:18] (lldp told me) [09:09:33] it's possible yeah [09:10:08] let me know if there is yet anothe knob to tweak [09:10:14] er, let me check* :) [09:10:24] ack :) [09:10:28] vgutierrez: I've merged the late-command patch and ran puppet on install*, cp3055 should probably be reimaged so that it matches the 3036 and later? [09:10:46] 3056? right <3 [09:11:15] vgutierrez: fyi I think akosiaris implemented that LLDP to icinga parent feature [09:11:39] in case nothing can be done on the switch side and it needs to be fixed on the puppet side [09:12:07] so as volans mentioned we could add the FQDN as the group name on icinga [09:12:28] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3055.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [09:12:29] moritzm: I'm triggering a reimage now [09:12:59] hmmm maybe I was too fast [09:13:10] moritzm: I need to trigger a puppet run on install1002, right? [09:13:24] https://lists.gt.net/nsp/juniper/66466 [09:14:00] I ran puppet on install* via Cumin already [09:14:11] so if you don't see anything changed in puppet output, that's fine :-) [09:14:54] oh ok [09:14:54] :) [09:15:02] vgutierrez: so from that thread, it's now the new behavior for junos to do it that way [09:15:02] thx [09:15:08] ack [09:15:18] so let's change the group name to the FQDN then? [09:15:30] wfm [09:15:46] I can also remove the domain name on the switch side [09:16:18] but it would be not standard on our side (vs. all other devices) [09:17:15] dunno what are the implications of the change in your side TBH [09:18:49] none as far as I know other than config differences from our standards [09:19:45] if the implcation on the puppet side are more than a variable change let's do it on the switch side, otherwise on the switch [09:20:14] bah I can't even type normal sentences :) [09:20:22] but you understand me [09:22:58] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3006.esams.wmnet'] ` and were **ALL** successful. [09:25:18] XioNoX: https://gerrit.wikimedia.org/r/#/c/545782/ [09:25:27] moritzm: are you ok with the approach? :) [09:27:22] akosiaris: ^^ [09:29:29] it seems fine to me, slowly our network gear will converge to 18.x and we need to start at some point [09:29:46] btw, cp3055 reimaged succesfully this time, thx <3 [09:30:34] vgutierrez: hm I'm also wondering if https://github.com/wikimedia/puppet/blob/847cc7412b66886b4992b3cadd2db30d9c95afd0/modules/netops/manifests/monitoring.pp#L116 should change too [09:30:47] cool :-) [09:30:57] otherwise how does it "know" that this is the parent host [09:38:59] hmm ack [09:39:02] I'll fix that [09:43:37] XioNoX: change updated [09:46:41] XioNoX: Oct 24 09:45:52 lvs3006 pybal[11502]: [bgp.BGPFactory@0x7f66d7bde488] INFO: BGP session established for ASN 64600 peer 91.198.174.244 [09:46:48] XioNoX: Oct 24 09:45:52 lvs3006 pybal[11502]: [bgp.BGPFactory@0x7f66d7c275f0] INFO: BGP session established for ASN 64600 peer 91.198.174.245 [09:46:48] nice! [09:46:53] yup [09:47:09] dns_rec monitored services are struggling though [09:58:50] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3055.esams.wmnet'] ` and were **ALL** successful. [10:02:21] Oct 24 09:59:51 lvs3006 pybal[11502]: [dns_rec_53 IdleConnection] WARN: maerlant.wikimedia.org (enabled/partially up/pooled): Connection to 91.198.174.122:53 failed. [10:02:38] ^^ XioNoX, that's ferm related or router related? [10:03:49] iptables seems fine [10:03:56] 42273 2771K ACCEPT udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:53 [10:04:51] vgutierrez: what's the issue? [10:05:06] lvs3006 to 91.198.174.122 ? [10:05:09] yep [10:05:27] and lvs3006 to 91.198.174.106 [10:05:30] (nescio) [10:05:35] should I be able to ssh to lvs3006? [10:05:39] yes [10:05:43] I'm in via ssh right now [10:05:45] I can't ssh [10:05:48] uh? [10:05:53] me neither [10:06:25] ping -4 does not work [10:06:28] IPv6 does [10:06:37] bast3002:~$ nc -zv lvs3006.esams.wmnet 22 stalls [10:06:41] I've just opened another session to lvs3006.esams.wmnet [10:06:44] via bast5001 [10:07:01] vgutierrez: IPv4 is broken (10.20.0.16) [10:07:54] hmmm [10:08:43] * volans back, sorry took longer than expected [10:08:59] so, the other option is to do it in facter [10:09:06] and remove the FQDN part [10:09:19] 64 bytes from lvs3006.esams.wmnet (10.20.0.16): icmp_seq=1 ttl=59 time=318 ms [10:09:28] so... it's broken from bast3002 [10:09:45] it's already a custom fact [10:09:56] vgutierrez: IPv4 is broken from maerlant as well [10:10:03] err [10:10:08] it's a issue only within esams [10:10:28] wtf? :) [10:11:04] bast1002 can ping lvs3006 on ipv4 and ipv6 as well [10:11:10] but bast3002 can't [10:12:28] lvs3006 can ping cp3030 just fine [10:12:51] I can send a patch for the lldp stuff [10:13:17] volans: there is a patch already with the opposite approach [10:13:34] changing the hostgroup in hiera? [10:13:37] yes [10:13:56] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/545782 [10:14:04] ok, that works too, *but* note that all lldp facters use that value [10:14:25] lldp, lldp_neighbors, lldp_parent [10:14:36] so depending where they are used in puppet code [10:14:45] we might have other weird behaviours and breakage [10:15:33] from a quick git grep it seems used only for icinga [10:15:38] but wanted to mention it [10:16:04] hmmm lvs3006 it's on a weird state [10:16:07] let me reboot it [10:16:32] cause our lvs boxes have netfilter banned [10:16:34] right? [10:16:58] correct [10:17:06] spare --> lvs transition [10:17:55] 10Traffic, 10Operations, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10fgiunchedi) I've updated the frontend-traffic dashboard to include global availability correctly, and got rid of the summed value [10:18:49] meh, we really need a role(in_setup_no_ferm) or so [10:20:42] yeah [10:20:42] so.. it isn't iptables related [10:22:28] icinga config back OK [10:23:29] akosiaris: that's the best doc so far - https://lists.gt.net/nsp/juniper/66466 [10:24:58] akosiaris: and that's the PR (problem report) behind auth https://usercontent.irccloud-cdn.com/file/JgVsIblv/Screenshot_2019-10-24%20PR1383295%20-%20Juniper%20Networks%20PR%20Search.png [10:25:22] I replied in the CR :) [10:28:12] me too! [10:29:15] XioNoX: re lvs3006 [10:29:18] vgutierrez@lvs3006:~$ ping -4 bast3002.wikimedia.org [10:29:18] PING bast3002.wikimedia.org (91.198.174.113) 56(84) bytes of data. [10:29:18] From vl100-enp5s0f0.lvs3006.esams.wmnet (91.198.174.16) icmp_seq=1 Destination Host Unreachable [10:30:21] routing issue on lvs3006 ? [10:30:30] XioNoX: thanks! [10:30:55] vgutierrez: seems like the packets don't know how to leave the machine [10:31:04] so someone actually wanted that [10:31:06] vgutierrez: check the routes and if they are associated to the proper interfaces? [10:32:05] cause XioNoX so... 91.198.174.16 it's on enp175s0f0.100 at lvs3006 [10:32:59] they wanted to add the FQDN instead of the hostname. And ok, juniper decided to implement, fine. But nope [10:33:00] While backward compatibility is a priority, these were PRs and [10:33:00] intentional fixes to move toward standard behavior, and I'm told [10:33:00] they were in the release notes as customer visible changes. [10:33:05] ffs [10:33:30] but I cannot see how's that any different from any other lvs box [10:33:37] my my. Anyway, I guess not high priority [10:34:28] XioNoX: the route table is exactly the same as the one in lvs3002 [10:34:29] sigh [10:34:33] XioNoX: I like the useless table in "Resolved in" in the PR you posted. [10:34:45] the second column named "junos" and always having an x [10:36:14] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201... [10:36:14] hmmm [10:36:24] 10:35:40.181713 ARP, Request who-has 91.198.174.113 tell 91.198.174.16, length 28 [10:37:17] akosiaris: yeah not much we can do anyway, saying "it's in the relases notes" is just BS, they have thousand of pages of release notes [10:37:38] vgutierrez: only v4 doesn't work? [10:37:40] XioNoX: could be a vlan issue on the switch side? [10:37:46] yep [10:37:47] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [10:38:48] vlan issue should be all or nothing [10:38:51] but yeah looking [10:38:52] what doesn't work is IPv4 on the public vlan [10:39:01] cause I can ping cp3030 via IPv4 [10:39:16] bast3002 <--> lvs3006 goes via the public vlan though [10:40:49] so lvs3006 is only in the private vlan [10:40:58] shouldn't it be trunked? [10:41:06] with the private as native? [10:41:11] let me check other sites [10:41:25] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3057.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [10:41:51] yeha it's that [10:41:53] fixing [10:41:54] :_) [10:41:56] akosiaris: I don't have strong preferences hostname vs fqdn here, I think in the long term we should be consistent, so either convert them in hostname in the fact, or find a knob to change either the new or the old behaviour in junos. Just waiting for the junos to get upgraded/replaced seems a bit too far down the line [10:42:25] but I might miss some other corner cases in which having the fact and what's reported by lldap differ might bite us [10:44:27] vgutierrez: it works now :) [10:44:30] Oct 24 10:44:06 lvs3006 pybal[1818]: [dns_rec_53_udp] INFO: Server maerlant.wikimedia.org (enabled/partially up/pooled) is up [10:44:31] oh yes [10:44:34] XioNoX: thx :D [10:44:35] volans: I don't particularly care about one vs the other. It's the non-overridable change in behavior that's killing me. [10:44:40] XioNoX: nice! [10:44:48] 64 bytes from bast3002.wikimedia.org (91.198.174.113): icmp_seq=1 ttl=64 time=0.124 ms [10:44:49] yeah I know [10:44:50] lovely [10:45:42] we can ofc just change the LLDP fact just a bit to achieve compatibility, but from the looks of it this is going to break for SNMP as well ? [10:46:05] not sure, need to crossref versions [10:46:28] let me fix the iface name on dns as well [10:46:46] cause of course.. ENI is SO predictable that even bbl.ack can't get it right [10:47:05] and we know that bbl.ack is always right [10:49:24] yeah I'm worried too that a fix in the lldap facts would not be enough [10:51:07] I don't think we use the hostname var from snmp [10:51:17] akosiaris: ^ [10:52:26] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3060.esams.wmnet'] ` [10:52:47] uh? [10:52:53] XioNoX: not even in librenms [10:52:54] ? [10:53:03] who knows what librenms does on the other hand ... [10:53:03] only display it [10:53:21] oh my all my internal alarms are going to go off on that one [10:53:22] 10:52:15 | cp3060.esams.wmnet | Unable to run wmf-auto-reimage-host: Failed to puppet_first_run [10:53:27] happy times [10:53:43] imagine a list of all hosts sorted by sysName and then 1 being different [10:53:48] better call volans [10:54:25] ema: weird, cp3055 was happy [10:54:38] cp3060 is a text one though [10:54:45] maybe something is missing there [10:54:58] ema: check the cumin log (path at the top of the output), go to the bottom [10:55:03] there is the outut of the puppet run [10:55:05] less -R [10:55:09] (old issue) [10:56:30] volans: we really really need to fix the fact that cumin does not log to irc the hostname of the host it's acting on [10:56:36] as it is now the !log is useless [10:56:45] (see #-operations right now) [10:56:45] you mean spicerack [10:56:55] yes I need to go back to my patch that was blocked in review [10:57:37] volans: due to PEBKAC I've lost the output of my reimage. Where do I find logs? :) [10:58:03] ah, under /var/log/wmf-auto-reimage/ [10:58:15] yep [10:58:31] with hostname and user [10:58:38] the _cumin one [10:59:09] ah yes [10:59:18] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, cron_splay(): this host not in set (file: /etc/puppet/modules/cacheproxy/manifests/cron_restart.pp, line: 16, column: 14) on node cp3060.esams.wmnet [10:59:45] do you use query_nodes there? [11:00:03] vgutierrez: that's why the cache_upload reimage worked, there's no need to restart shit in cron there [11:00:23] ouch! [11:00:32] ema: if you use query_nodes I've the fix [11:00:42] already used elseqhere [11:02:21] volans: nope, hiera [11:02:48] :/ all hardcoded [11:06:25] volans: not for long :) [11:06:51] vgutierrez: I'm gonna add cp3060 to cache::nodes, start the reimage and go for lunch [11:07:01] so if you move that to puppetdb query_nodes, don't forget to use the trick: [11:07:04] ema: ack [11:07:08] unique(concat(query_nodes('yourquery'), [$::fqdn])) [11:07:11] cp3057 is finishing and I'm going for 3059 as well [11:07:24] after that I think I'm leaving the rest for you [11:07:34] I've been working 9h non stop now :) [11:09:37] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3059.esams.wmnet'] ` The log can be found in `... [11:09:58] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/lo... [11:10:09] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3060.esams.wmnet'] ` [11:10:20] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/lo... [11:11:20] volans: thanks! [11:11:21] vgutierrez: go! [11:11:26] yeah... [11:11:29] after cp3059 ;P [11:12:02] dns3002 seems happy, lvs3006 as well (bgp with both routers) [11:12:19] cp3055 is all green on icinga as well [11:12:38] I'll leave as soon as I reach the same state for 3057 and 3059 [11:14:26] imagine you guys saw this but incase not https://cpdos.org/ [11:14:41] yeah [11:14:50] well.. at least I've seen it [11:16:35] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3057.esams.wmnet'] ` and were **ALL** successful. [11:17:58] ema: cp3060 is screaming on the ipsec checks already :_) [11:35:54] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Vgutierrez) [11:36:12] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Vgutierrez) [11:36:29] what was the lvs<->bast issue? [11:36:38] (there should be routability anyways) [11:36:56] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Vgutierrez) [11:37:14] bblack: vlan configuration on the switch side [11:37:56] yeah I get the mimssing vlan config for lvs3006 port [11:38:18] oh now I really get it, ok. so lvs3006 had the vlan trunking configured on its end, so couldn't reach bast [11:38:31] I'm still catching up on all the puppet level fixes, etc [11:38:51] so I've fixed the dns3002 FQDN on the DHCP config, the iface name for lvs3006... [11:38:59] interface_tweaks... [11:39:03] the usual suspects :) [11:39:23] what was up with interface_tweaks? [11:39:33] oh the hieradata for lvs3006 interface name being wrong [11:39:35] it was missing [11:39:41] and our puppetization doesn't like that [11:39:53] ? [11:40:15] the hieradata was there, just wrong interface name. what was missing? [11:41:11] and yeah the nvme late_command fix, I guess you reinstalled any that missed it (or manually executed what it does, one of the two) [11:41:45] yeah, I reimaged cp3055 [11:41:49] bblack: https://gerrit.wikimedia.org/r/c/operations/puppet/+/545785/1/hieradata/hosts/lvs3006.yaml [11:42:04] interface_tweaks not interfaces.yaml [11:42:34] ok [11:42:48] right, I copied from ulsfo, which has it set at the per-dc level right now since they all match heh [11:43:39] oh, I've also added the storage config for the cp hosts [11:44:35] ah yeah, good catch! [11:45:14] so.. as soon as cp3059 is happy I'm out of here [11:45:32] ema was fighting a little bit with cp3060 and it's on his lunch break now [11:45:35] oh [11:45:37] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3059.esams.wmnet'] ` and were **ALL** successful. [11:45:43] you added the storage_parts I guess, but not the sizing? [11:46:04] hmmm [11:46:05] will fix that bit [11:46:58] oh.. I think I know what you mean.. nope, I didn't add it [11:47:52] yeah https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545820/ [11:48:23] I think it only actually affects the varnish-be storage params on the CLI executed by systemd, so restarting the varnish-be process after the change is pushed should fix it [11:48:44] cool.. that will be necessary only on cp3060 so far [11:48:53] BTW, papaul reported on -traffic a SSD issue on cp3056 [11:48:59] s/-traffic/-dcops/ [11:49:26] that sucks, because I doubt we can get a replacement in this week [11:49:35] but we can live with a dead node initially :) [11:50:06] one last thing: any confirm on whether bios setup was done at all and just the nic-disable missed, or if we really need to go do all the bios settings? [11:50:23] hmmm nope [11:50:29] ok I'll ask in a bit [11:50:31] I just reported what we did on -dcops [11:50:35] but nothing else [11:50:40] ok thanks! [11:51:06] so.. TL;DR: cp305[5,7,9] are happy, dns3002 and lvs3006 too [11:51:32] did arzhel already do the router sides for dns3002/lvs3006? [11:51:48] so for lvs3006 he did at least the bgp parts [11:51:48] lvs3006 yes [11:51:53] cause pybal is reporting as expected [11:51:57] ok awesome [11:54:21] yeah.. for some reason wmf-auto-reimage is failing to downtime the new hosts [11:55:51] hmmm [11:56:03] I assume none of these new cps are pooled yet [11:56:07] nope [11:56:23] adding BGP for dns2002 [11:56:26] er, 3002 [11:57:29] technically the disks are slightly different from eqiad, too (Dell stopped selling the previous PM1725a, now it's PM1725b), so I'm digging around on cp3055 right now to confirm the attributes of it, etc [11:57:42] ack [11:57:45] I'm off for dinner [11:57:52] ping me if you need something from my side :) [11:58:04] thakns for all the help! [11:58:07] np! [11:58:43] err I guess 3056, that one is text :) [11:58:51] I can check the missed downtimes if needed (back from lunch, didn't read full backlog yet) [11:58:55] oh its the bad one [12:00:36] none of the text nodes are avail yet except cp3060 which I think is still in reimage script, so will have to wait on the sizing confirm, etc [12:02:11] XioNoX: other random thing from last night, asw2-esams couldn't reach any NTP servers and had the wrong clock time [12:02:41] ah, thx for the head's up, not urgent adding it to my list [12:06:55] ema: you're on the cp3060 reimage script I think? it looks like it might be getting into a bad state, but maybe it will recover, who knows [12:06:58] spewing kernel:[ 1453.787072] NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [charon:1226] [12:07:07] and loadavg is at 82 and climbind, during the puppet run [12:08:50] that's the puppet run after the reboot afte rthe installation [12:09:12] see /var/log/wmf-auto-reimage/201910241110_ema_214430_cp3060_esams_wmnet_cumin.out on cumin1001 [12:09:19] sorry, the one without _cumin [12:09:23] 201910241110_ema_214430_cp3060_esams_wmnet.log [12:18:43] volans: random unimportant finding for later: [12:18:44] 2019-10-24 11:48:28 [INFO] (ema) wmf-auto-reimage::print_line: Still waiting for Puppet after 5.0 minutes [12:18:48] 2019-10-24 11:55:43 [INFO] (ema) wmf-auto-reimage::print_line: Still waiting for Puppet after 10.0 minutes [12:18:51] 2019-10-24 12:02:58 [INFO] (ema) wmf-auto-reimage::print_line: Still waiting for Puppet after 15.0 minutes [12:18:54] 2019-10-24 12:10:13 [INFO] (ema) wmf-auto-reimage::print_line: Still waiting for Puppet after 20.0 minutes [12:18:57] 2019-10-24 12:17:29 [INFO] (ema) wmf-auto-reimage::print_line: Still waiting for Puppet after 25.0 minutes [12:19:03] ^ "still waiting" claims 5 minute intervals in the message, but the stamps are ~ 7m15s apart [12:19:19] lol [12:20:58] technically correct [12:21:12] "after" 5 minutes :) [12:21:20] I'm wondering if the check is so slow that adds time in this case [12:21:56] well kinda technically correct [12:22:39] the consistency of the 7m15s thing is odd. it's probably a 5 minute actual sleep, followed by a ~2m timeout checking things again, or something [12:22:47] I can't still ssh after 30s [12:22:50] kernel:[ 2459.419806] NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [charon:1226] [12:23:04] yeah probably it has hardware issues [12:23:13] good start! [12:23:15] but it's also possible it's just a temporary glitch, could try killing ipsec [12:23:19] sounds like it, might be a cpu not well connected [12:23:20] I've managed to ssh [12:23:24] (or re-rebooting) [12:23:36] only charon (ipsec) has been in those msgs, that I've seen [12:23:45] yeah charon is using 100% cpu [12:23:48] trying to stop it [12:24:07] the check itself takes 0.6s but cumin does it via ssh [12:24:13] that might explain the added time [12:26:27] no luck, rebooting [12:29:04] the host looks fine now, puppet is doing its puppeting [12:29:19] https://etherpad.wikimedia.org/p/esams-followup [12:29:33] ^ started that, from our perspective not being onsite, things to remember as we go [12:29:52] ack [12:31:45] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` and were **ALL** successful. [12:31:46] 3060 stabilized? [12:31:51] maybe just awful ipsec [12:32:01] yeah, it came up fine and puppet did its thing [12:32:20] ema: it's the first text node that's avail, I need to go stare at the nvme config on it to see if it needs tweaking [12:32:34] ok [12:32:38] are the new servers in service yet? [12:32:41] sorry, haven't been following that [12:32:53] maybe I should say "already" :) [12:33:42] mark: WIP :) [12:34:12] cool :) [12:34:50] we don't have many of them yet :) [12:35:07] (and 1/6 cp's may have initial hw issues so far) [12:36:01] bblack: an important thing to keep in mind is that due to recent changes to confctl/etcd defaults weights do not exist anymore [12:36:12] so we manually have to set the weight for ats/varnish-be to 100 [12:36:22] wat? [12:36:39] weights really "don't exist"? [12:36:52] where do we manually set them? [12:37:17] we need to do things like: [12:37:18] sudo -i confctl select name=cp4029.ulsfo.wmnet,service=ats-be set/weight=100 [12:37:44] oh weights still exist, they're just not provisioned initially? [12:37:48] correct [12:37:59] ok :) [12:38:05] still kinda crazy, but ok [12:38:05] "default weights" are gone [12:38:12] why? [12:38:17] not that this is the time I guess [12:38:28] or, rather the chance to set them to something different than 0 [12:38:45] I'm done with 3060, all looks good on the nvme stuff [12:39:05] (there's like 10MB more room, but that's too small to matter to our sizing param, and the nvme formatting worked fine too) [12:39:16] bblack: d9f83487c6 is the commit that introduced the change, I'm not 100% clear on the rationale [12:41:19] I don't think we've ever really used weights anyways between varnish-fe<->be? [12:42:30] we have sometimes during past transitions between weak nodes and much stronger nodes (e.g. the initial transition away from the old amssq caches in esams) [12:42:45] but I don't think we've ever used them usefully outside of a transition [12:43:21] weights also currently make critical differences at the IPVS layer for the front edge [12:43:55] (they map to weights in ipvs, and with our hash balancer it has a direct impact on hashing. e.g. if you change all the frontends in a dc from weight=1 to weight=2, it may rehash all clients) [12:44:14] (also, if the sum of all ipvs front weights in a cluster+dc exceed 255 it won't work right with ipvs hash, last I checked) [12:44:21] (that's why they're small values now) [12:45:35] oh very good point [12:45:46] so we need to manually set the -fe weights to 1 too :( [12:46:17] yeah [12:46:44] there's 2x fe services, the 'varnish-fe' and 'nginx' (which is sometimes ats-tls now, but keeps the name heh) [12:47:57] XioNoX: any general status update on e.g. if we're likely to get more hosts (racks 14 or 16) today? [12:48:14] or blocked on other dependencies? [12:50:48] _joe_, cdanis: any chance to re-introduce default weights for confd services? Setting them by hand isn't great [12:51:58] <_joe_> the default weight to zero makes sense in a lot of ways, the idea being a newly-created object is uninitialized [12:52:18] yeah but if we know all our cluster's real default for live nodes is 100 [12:52:24] <_joe_> so I am quite against reintroducing per-service defaults. They proved harmful in a lot of cases [12:52:31] it means every new cluster member, we have to explicitly set it to 100 and remember that it should be 100 by looking at the others [12:52:37] <_joe_> bblack: it's also that the node is pooled [12:53:07] what was harmful about a default value for a cluster? [12:53:37] <_joe_> that in some cases, people misused the default values to set nodes to pooled=true for example [12:53:52] <_joe_> or to weights that had little to do with what they actually used in production [12:53:59] doing this manually is a lot of additional toil, and error prone really [12:54:03] <_joe_> bblack: you are telling me you don't really use the weight [12:54:26] <_joe_> ema: so you automatically pool your nodes in some way? [12:54:45] no, I'm telling you we have default weights that work for us 99.999% of the time, but we do occasionaly like to have the capability to change them. [12:54:49] <_joe_> because they were depooled by default before [12:54:59] so we like having weights, and we like having them default to their standard values [12:55:00] <_joe_> ema: correct? [12:55:18] _joe_: not automatically, we run 'pool' on the host usually [12:55:22] <_joe_> ok [12:55:33] I can understand blocking out the capability to default pooled=true (we like default false and that's the only sane value really) [12:55:40] <_joe_> so you can add a script that instead of just pooling, also modifies the weight [12:55:47] but having the weights pre-set per cluster seems nice [12:55:59] <_joe_> it's one confctl command [12:56:04] _joe_: yes, so now we need to store the weights somewhere so the script knows what to set them to per cluster... [12:56:10] if only there were a place that already had those defaults [12:56:37] <_joe_> bblack: I am saying you can call [12:56:42] <_joe_> pool-set-weight 100 [12:56:51] <_joe_> for instance [12:57:48] anyways, it's not worth the drama. I think preventing anyone from setting a default of pooled=true is an improvement. I don't agree that removing the ability to set cluster default weight was a good idea. [12:57:51] <_joe_> but I sense you're really irritated and we're not really doing progress here. Default weights can be reintroduced for cache nodes in a variety of ways, none of which will require a ton of effort, either [12:57:52] we'll deal! [12:58:55] <_joe_> but I'd like it to be a special case [12:59:26] fwiw removing service objects / defaults was a good deal of simplification to the conftool code, is part of the reason why it's gone [12:59:45] but yeah I hear you it's annoying [12:59:55] <_joe_> the "services" tree was just static data that ended up in conftool for no good reason, more or less, because back then I hoped we could use those to dynamically configure pybal [13:00:03] it's also the only place we codified what our preferred defaults were [13:00:25] e.g. that the 'nginx' services defaulted to 1, not 100. the numbers aren't abstract in that case. [13:00:41] (they have to add up to less than 255 for the nginx service cluster or ipvs breaks) [13:02:19] (to be clear: for the hashing director, not the roundrobin you have elsewhere, don't be scared) [13:02:21] <_joe_> ok, the default to 0 means the node is uninitialized, that was the logic. But we can find an easy way to deal with it [13:02:30] <_joe_> bblack: sure I am aware :) [13:03:56] <_joe_> now, would having this in - say - puppet code and having a script called "initialize" or some other name to call instead of "pool"? [13:04:00] <_joe_> would that work? [13:04:24] at the risk of being very naive -- could this be solved with a few lines of shell script that just does confctl select service=varnish-fe,name=$FQDN set/weight=100; confctl select service=nginx,name=$FQDN set/weight=1 [13:05:13] yes, it can be solved a number of ways :) [13:05:25] but it's varnish-fe=1, nginx=1, varnish-be=100, ats-be=100 [13:05:35] nod [13:05:49] which we need to remember now and store somewhere else, whereas it was right there in the defaults hieradata before :P [13:06:31] <_joe_> cdanis: that's exactly what I was proposing heh [13:06:36] 10Traffic, 10MediaWiki-REST-API, 10Operations, 10Parsoid-PHP, and 3 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac) [13:06:47] it's just a point of annoyance in the midst of an annoying span of days [13:06:52] yeah I hear that [13:07:14] <_joe_> to write a script that does that and reads the default values somewhere in hiera [13:08:24] did all of that data just go away with the change? [13:08:37] (the per-site/cluster stuff in conftool-data?) [13:09:08] I'm just now reading through the new stuff for real [13:09:29] <_joe_> yes, it was meaningless for everyone else so we just removed it. I frankly had no idea it was valuable to someone [13:09:52] <_joe_> actually I had to repeatedly explain how it worked to people and almost everyone found it confusing [13:09:55] by looking at the current state in etcd I've noticed that we have two mw hosts with nginx pooled and weight 0 FTR, which should be fine as RR does not use them but is inconsistent with all others [13:10:08] mw2150 and mw2151 [13:10:13] so there's no service objects now, I see [13:10:26] yeah, the service objects were mostly living in etcd for no reason [13:10:43] and adding some special cases to the conftool code itself [13:10:57] but in hieradata I assumed (maybe I was always wrong) they *defined* the set of services and clusters and which DCs they existed in, etc, for e.g. error-checking the per-node entries? [13:11:07] now if I typo a per-service node name we get a brand-new service? [13:11:23] <_joe_> you mean in conftool-data? [13:11:27] yeah [13:11:58] esams: [13:11:58] cache_text: [13:11:58] cp3030.esams.wmnet: [varnish-fe, varnish-be, nginx] [13:12:00] <_joe_> yes, but you did before as well, but you would get a warning [13:12:18] ^ this seems to be the only layer of data that exists now, with the still duplicate entries for all the nodes with matching services, etc [13:13:43] (which is fine, they should be customizable for transitions anyways) [13:14:16] just the data model doesn't sit well with my brain, the idea that services and clusters aren't first-class notions, just arbitrary tags added to arbitrary groups of nodes [13:14:20] maybe it's just me though [13:17:35] the current state is more an artifact of conftool's implementation than anything else, but they weren't really first-class notions before -- it wasn't like the stuff in conftool-data had anything to do with the stuff in hieradata for LVS, for instance, except that the same name had to be used both places [13:17:46] I think in some ideal world that wouldn't be a separate definition [13:18:07] yeah [13:18:36] I guess the illusion was comforting before (even just within conftool-data/ , that there was a definition of service/cluster/dc -level stuff) [13:18:42] and the warnings I guess [13:19:04] the warnings are a bit of a loss, yes [13:24:45] ema: I fixed all the new-esams weights, didn't pool anything [13:25:41] I'll keep all this in mind, there are several things I'd like to change in a hypothetical conftool v2 [13:26:03] <_joe_> cdanis: removing conftool-data would be a start [13:27:22] bblack: ack [13:28:05] bblack: anything else you wanted to check on cp3060 before pooling it? [13:30:08] I don't think so [13:32:15] ema: I'm guessing the upload nodes, we can start pooling in too [13:32:38] but I'm not up to speed on everything with ats transition, etc [13:33:14] I'm gonna repool dns as well, so we have some traffic warming any new caches as they pool in [13:33:20] and we can start pulling out legacy caches as they warm up too [13:34:24] bblack: maybe let's first DNS repool esams and then pool the new caches one at a time to check if things are ok? [13:34:44] yeah that [13:35:59] in upload the old caches outnumber the new, and in text the opposite, so it won't be perfect 1:1 swaps, but whatever [13:36:14] we'll approximately pool one depool one, etc and make up the diffs somewhere at the end [13:37:18] should look at netbox for info on which to depool/decom first too [13:37:29] (so that it's an easy physical set freed up first) [13:38:31] the way the old CP nodes are set up: cp304x are all in one rack, and cp303x are in another [13:39:42] so we should do all our decoms from one set first until it's empty [13:39:56] pp has no pref, so let's start with cp303x for decoms [13:44:30] so we still have quite a few reimages to go through [13:45:14] bblack: I'll proceed with cp3056 [13:47:09] bblack: also, should we maybe just add all new hosts to cache::nodes in one go and ACK the alerts on icinga? [13:47:46] the IPsec alerts, that is [13:50:16] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/lo... [13:54:06] btw.. let me know which hosts are ready to be installed by the end of the day and I'll get to it in the APAC morning :) [13:54:10] I'm about to go into a meeting, but please let me know if more downtime fail, in particular if you get any output that I might not have in the logs [13:55:57] ema: yes, can do that [13:56:06] right, text reimages fail if the host isn't in cache::nodes [13:56:21] yeah [13:58:40] alright, pooling cp3060 while 3056 reimages [13:59:09] 3056 SSD is healthy then? [14:00:10] I think pp said the nvme might be bad [14:00:19] so it may reimagine and come up otherwise, just without be cache storage [14:00:40] *reimage [14:00:52] cp3060 serving prod traffic [14:01:00] \o/ [14:01:34] FYI I'm asking about whether we should just stay depooled (if it speeds up onsite work), in which case the repool plan may change [14:01:52] well I guess it won't change much in practice here for what we're doing right now, but still [14:02:15] (we'd still swap nodes, just without traffic flowing, and have to deal with a cold repool of the site later) [14:02:34] I'm gonna go poke at the other non-cp nodes in a bit [14:02:49] (get lvs3006 into primary service, spin up the ganeti node, etc) [14:09:07] bblack: sanity check please? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545857 [14:11:41] XioNoX: knams issue known/expected? [14:12:31] ema: +1... I think it will also affect e.g. cache director lists, but thats ok too as long as the others stay depooled/inactive in confctl [14:13:38] XioNoX: yeah the knams thing - we did repool earlierm and we do have isolated user complaints of blips of reachability to esams, I assume it's the knams links causing that... [14:16:09] mmh cp3056 isn't rebooting into the installer. Trying a power-cycle [14:16:37] ema: 3056 is the one pp said had an orange light on the panel and reporting some kind of SSD (or nvme?) failure [14:16:44] we may have to just skip it for now [14:18:32] ack, trying 3058 then [14:19:11] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/lo... [14:19:46] sweet [14:19:54] cp3056 is stuck at "Booting from Hard drive C:" [14:20:08] sounds like no PXE [14:24:11] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 16 cp3061 : xe-6/0/15 cp3062: xe-6/0/16 cp3063: xe-6/0/17 cp3064: xe-6/0/18 cp3065: xe-6/0/19 [14:30:08] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3007 switch information xe-6/0/12 [14:31:32] 3058 is installing, happy times [14:32:32] on the ticket updates pp is making above, I still need to do switch side for those too [14:32:52] and basic dns, and macaddrs [14:33:28] will finish LVS juggling first though :) [14:39:16] mmh it looks like we're missing 3062 and 3064's mgmt IPs [14:39:34] are any of the new set mgmt in dns yet? [14:40:26] yeah, all the ones we've reimaged so far (wmf-auto-reimage does not work otherwise) [14:40:35] well yeah [14:40:57] there were only 9 hosts installed yesteday (cp3055-3060, dns3002, ganeti3002, lvs3006) [14:41:05] pp is just now finishing up the next batch physically [14:41:20] probably hasn't committed e.g. mgmt dns yet [14:42:13] ah [14:42:21] the next batch will be cp3061-5, lvs3007, bast3004, and ganeti3003 [14:42:41] then the final batch to come will be cp3050-54, lvs3005, dns3001, ganeti3001 [14:54:53] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3058.esams.wmnet'] ` and were **ALL** successful. [14:58:24] bblack: looking for knams [15:00:21] bblack: 3058 is up and running, all hosts in the cp3055-3060 are reimaged [15:00:29] except for 3056 (SSD issues) [15:02:48] ema: can you start pooling them (well the 5 we have) into clusters and pooling out legacy ones as you go? we'll probably have to be a little more aggressive than ideal, so long as nothing falls apart from miss-load [15:05:52] lvs3006 is now primary for esams upload(etc), 3002/4 are both non-primaries for it. will wait to decom any of these until we get at least 3007 online. [15:07:59] bblack: sure [15:11:20] XioNoX: dns3002 - seems to be configured on the router side and seeing some routes to it in the list for 10.3.0.1, but on dns3002 itself I'm not seeing any queries arrive for 10.3.0.1 at all [15:11:25] so something's not quite right, maybe [15:12:06] XioNoX: nevermind cancel that comment, eventually some traffic showed up [15:12:11] edges don't spam recdns like the core sites do :) [15:22:17] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10RobH) p:05Triage→03Normal [15:22:33] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10RobH) [16:24:24] text nodes pooled: cp3058 and cp3060, depooled cp3030 and cp3032. Hitrate just fine [16:24:57] upload is affected by (de)pools more significantly, I'm going slower there [16:25:18] here the effect of upload (de)pools on swift: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=10&fullscreen&orgId=1&from=1571928114204&to=1571934127114 [16:29:18] upload nodes pooled: cp3055 and cp3057, depooled cp3034 and cp3035 [16:30:23] I'm waiting for the upload hitrate to recover a bit and then I'll pool cp3059, depool cp3035 and call it a day [16:31:06] err, s/depool cp3035/depool cp3036/ ^ [16:55:04] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs3007.esams.wmnet'] ` The log can be found in `/var/log/wmf-au... [17:27:45] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] ` and were **ALL** successful. [17:35:06] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) merging in duplicate ticket T236409 where i started OS install [17:35:32] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [17:36:38] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [17:37:25] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) confirmed mgmt and production DNS exists, mgtm password is set, IPMI over LAN working.. started OS install [17:39:57] ema: while you're still here, what's the current-best magic for managing the reimage/post-reimage stuff on cps? [17:40:10] (I imagine you have some commandline that disables service X and runs puppet 3 times or whatever) [17:40:47] bblack: it should just be a matter of running puppet a couple of times now! [17:41:07] no further magic involved [17:41:26] alright, I've pooled 3055, 57, 58, 59 and 60 [17:41:35] depooled 3030, 32, 34, 35, and 36 [17:42:26] I've gotta go afk, cya! [17:43:05] ok cool [17:58:12] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) ` [bast3004:~] $ gen_fingerprints +---------+---------+-----------------------------------------------------+ | Cipher | Algo | Fingerprint |... [18:00:16] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [18:23:02] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [18:23:15] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs3005.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191024182... [18:24:20] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns3001.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reimage/2... [18:36:00] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3061.esams.wmnet', 'cp3062.esams.wmnet', 'cp3063.esams.wmnet', 'cp3064.e... [18:36:02] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3050.esams.wmnet', 'cp3051.esams.wmnet', 'cp3052.esams.wmnet', 'cp3053.e... [18:52:45] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] ` and were **ALL** successful. [18:55:56] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [19:30:06] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3001.wikimedia.org'] ` and were **ALL** successful. [19:36:16] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3062.esams.wmnet', 'cp3063.esams.wmnet', 'cp3061.esams.wmnet', 'cp3064.esams.wmnet', 'cp3065.esams.wmnet'] ` and were **A... [19:50:59] bblack: I checked asw2-esams and couldn't see any issue with the clock or NTP servers. Could you double check when you have some time? [19:51:07] maybe we're not looking at the same things [19:51:28] or it took the box a long time to get a proper "fix" [19:53:39] yeah will look again [19:53:50] cp3054 is having the same charon CPU stuck problem we saw on another earlier [19:53:56] but it seems to be persistent in this case! [19:59:20] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [19:59:25] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Dzahn) [20:04:05] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3053.esams.wmnet', 'cp3050.esams.wmnet', 'cp3051.esams.wmnet', 'cp3052.esams.wmnet', 'cp3054.esams.wmnet'] ` and were **A... [20:07:51] XioNoX: I assume the dead BFD between cr2-esams and cr2-knams is a known-issue too [20:08:32] bblack: yeah, ospf is up, still trying to figure out what's going on with bfd [20:08:35] ok [20:09:01] and yeah NTP on asw2-esams looks fine now [20:11:59] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) ^ made active bastion host with the global firewall change above created wikitech pages https://wikitech.wikimedia.org/wiki/Bast3004 https://wikitech.wikimedia.org/w... [20:12:27] cool! [20:12:49] next up (after a short break) I'm gonna finish re-arranging all the esams LVS so that 3001-4 can be decommed [20:13:03] I've already checked the router side for the new ones, and will clean up the old there as I go [20:13:13] 3006 is already live as a primary for high-traffic2 [20:13:41] and ditto for cleaning up and decomming the old recdns boxes, etc [20:13:58] although may have to edit ntp settings on network gear in esams first [20:14:24] and then there's all the cp3 depool/repool stuff to go [20:15:30] you can now edit your ssh config and replace bast3002 with bast3004. next will be to make it the install_server in DHCP and test if installing from it works [20:17:20] also: changing smokeping target (bast3002->bast3004) [20:17:53] oh yeah [20:18:01] I guess can re-reimage one of the ganetis to test it [20:19:29] we have to rsync all the data first [20:19:39] will upload a change for that [20:20:21] well "all the data" tftpboot = 1.8G prometheus: 36G [20:20:44] prometheus data will have to go to a new VM [20:21:23] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) [20:22:52] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Papaul) [20:24:10] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) [20:24:45] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) a:05Papaul→03Dzahn [20:25:04] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) service "bastion host" is ready but service "tftp" still needs to be migrated. taking it. [20:25:59] mutante: why a new VM? [20:26:22] bblack: apparently prometheus will move away from bastion to a VM [20:26:54] well, eventually [20:26:55] https://phabricator.wikimedia.org/T236329#5601691 [20:27:01] we don't have VMs in esams yet though [20:27:10] i got that from this..was wondering if i need to copy data from bast to bast [20:27:20] yes I think bast to bast for now [20:27:24] so the decom bast3002 task needs to wait [20:27:36] or we need to copy it to bast3004 as a temp location [20:27:44] ^ that [20:27:54] the timeline for having ganeti ready in esams is indefinite at this point [20:28:22] (and we're not using ganeti at ulsfo or eqsin either, I believe we copied data bast->bast when ulsfo bast was replaced) [20:30:46] bblack: quick question, are all the server side port named the same? enp175s0f0 ? (plus the sub interface for the LVS?) [20:31:44] I'm not sure yet for every machine, but can check [20:34:08] yea, i copied bast->bast in the past afair [20:34:10] XioNoX: the new cp machines have enp59s0f0 , and all the other new machines have enp175s0f0 [20:34:18] ok, thx! [20:34:36] it's a fancy world here in 2019. machines with 175 PCIe busses [20:35:11] well I guess p is port not bus, but whatever [20:35:18] this machine doesn't have 175 anythings :P [20:37:10] packet_write_wait: Connection to 208.80.153.54 port 22: Broken pipe [20:37:27] been getting an abnormal amount of those today while working on esams, but donno if it's esams or my home stuff :P [20:38:22] keep a mtr running in the background maybe? [20:38:40] hmm.. few people besides you use the codfw bastion i think [20:39:38] yeah could be that too [20:39:43] maybe worth trying to use bast3004 for esams now [20:41:10] restoring abandoned bastionhost::migration role from the depths of Gerrit [21:46:43] 10Traffic, 10Operations, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10WDoranWMF) @BBlack @ema would you have anytime to review Petr's patch above? [21:51:16] cache clusters status: [21:51:37] upload: still needs pooling: 61,63,65 still needs depooling: 45,46,47,49 [21:53:12] text: still needs pooling: 52,62,64 still needs depooling: 41,42,43 (and early hw fail on 56, explicitly depooled and not installed yet) [21:53:44] but hitrate is being slow to recover from the recent ones, will circle back to more movement here in some hours from now (esams will be lower-load then too) [21:54:28] lvs status: [21:54:42] lvs300[567] are the new lvs cluster doing all the things, lvs3001-4 are being reimaged to spare now [21:55:07] dns: dns300[12] are active and participating, but so are the old hosts too, need to clean those up later this evening (incl on the router/switch side) [21:55:33] running out to deal with life, be back in a while [21:58:31] wouldn't be quicker to just decom the old ones instead of reimage to spare? [21:58:52] I'm assuming they will be decomm'ed, but I might wrong :) [21:59:36] they will be tomorrow, but I don't even have a decom ticket for them and there's a process to it. the reimage to spare is just to make sure they can't run pybal and interfere with routing accidentally or whatever. [22:00:38] yeah the process has been changed and simplified a lot recently [22:00:48] the decom cookbook takes care of making them unbootable [22:00:56] and power them down [22:01:42] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission [22:01:50] and https://wikitech.wikimedia.org/wiki/Decom_script [22:04:13] you can go to https://phabricator.wikimedia.org/project/profile/3364/ and click "File Decommision Request" on the left. just fill in the FQDN [22:04:24] creates a ticket from template [22:05:37] but also existing decom tickets linked to https://phabricator.wikimedia.org/T235805 [22:06:35] like https://phabricator.wikimedia.org/T87790 et al [22:07:36] yeah we have them for the truly-old hosts, just haven't made them for the recently-in-use ones [22:07:43] anyways, I'll look at decom-like things later :) [22:07:45] if they are in the decom column on https://phabricator.wikimedia.org/project/board/951/ they should be seen by papaul in the morning [22:08:02] he said tomorrow is decom day [22:08:25] I'm obviously way behind on recent process changes (and netbox too), but my focus has been on getting the functional things done before we're out of time in ams [22:09:07] yea, makes sense! and it was all very fast [22:10:10] so you say lvs3001-3004 are going to spare.. i can make that ticket [22:10:53] I was saying this because running the decom cookbooks takes just a minute compared to a reimage [22:14:21] (and takes care of netbox too :-P ) [22:15:54] anyway, time to bed for me. ttyl [22:38:28] lots of netbox errors, im cleaning them up [22:38:38] report errors due to bad state for where systems are, etc... [23:35:24] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle)