[01:43:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:40] moritzm: not sure if you seen the discussion above? [09:31:05] The TL;DR is we want to do a port-mirror on the switch in codfw rack B7 to aid with the super-micro boot testing [09:31:30] we need to connect that mirror port to a host to do a tcpdump [09:32:11] we can try to get another sretest* host in place for that, but there was also a suggestion to maybe use an existing host, cabling it's second interface temporarily to do the capture. There are a few ganeti hosts in the rack I thought perhaps we could use, not sure if that is realistic? [09:35:41] ah, I missed that. let me read up now [09:40:09] I'll drain ganeti1043 [09:40:46] is that a test which will take a few days? then I'd also temporarily remove 1043 from the cluster to prevent that someone accidentially creates a new VM there [10:04:14] moritzm: sry got dragged away for a few mins. The host we're talking about is sretest2001, so it's in codfw rack B7. [10:04:37] Ganeti hosts in the same rack are ganeti2032, ganeti2033 and ganeti2049 [10:05:04] ah, wrong B7 then :-) [10:05:04] we can probably do all the tests we need in an hour or two, I think we should probably be ok just draining the one [10:05:24] and tbh just being a bit cautious to drain, likely it would have zero affect anyway [10:05:43] but yep if there is one that'd be great [10:05:48] ok, let me find which host in B7 would work bets [10:05:53] thanks <3 [10:17:12] ganeti2033 is drained, it's a node in the initial routed ganeti cluster, so noone will accidentally create a VM there, as such we can keep it simply part of the cluster for the experiments [10:21:29] ok great thanks for that... I'll talk to pa paul later about getting the second port connected and run through the tests with jessie [10:25:12] cool, no rush on ganeti2033 given it's just the routed ganeti test cluster, when yoi [10:25:28] you're done just ping me and I'll rebalance it [10:26:45] cool thanks [12:36:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11197244 (10cmooney) [13:21:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [13:46:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [14:06:04] topranks: thanks for working on the port mirroring, I saw the message in dcops, next week is fine [14:28:10] jhathaway: cool [14:28:58] moritzm: is that ok to leave that host drained until next week? dc-ops can’t assist till then [14:30:02] of course, no problem at all [14:30:27] this is only in the routed ganeti cluster in codfw, which we mostly used to ramp up the routed Ganeti stack [14:30:55] aside of test VMs it's just netflow/codfw and idp-test running there [14:31:11] and those can easily run on the other VM for now [14:37:02] ok cool [14:55:34] FIRING: DiskSpace: Disk space serpens:9100:/ 6.457% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:26:56] jhathaway: I'm just about to kick off a test reimage for sretest2009, and I notice it's a supermicro [15:27:07] is there anything special I need to do or flag to set when running the cookbook? [15:28:02] topranks: no, it should just work, fingers crossed [15:28:07] let me know if it doesn't [15:28:20] ok cool will do [15:28:35] it's on one of the Nokia's so it probably won't work, but that likely won't be because of anything on the host [15:28:38] thanks [15:30:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.618% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:40:38] topranks: for the nokia's you want to pass `--no82` [15:40:56] yep that's what I'm gonna try alright [15:40:59] thanks [15:41:01] cool [15:42:06] jhathaway: you might actually be able to point me in the right direction on something [15:43:06] the vlan in this new rack is not configured on the install server in codfw. I think because either the vlan or subnet is not synced to hiera from netbox (or perhaps I need to manually add them) [15:43:10] I'm looking at /modules/install_server/manifests/dhcp_server.pp [15:43:36] trying to work out where the data for $datacenters_dhcp_config comes from [15:43:42] hmm, let me look [15:44:04] modules/install_server/types/datacenter_dhcp/config.pp [15:44:08] modules/install_server/types/audience_dhcp/config.pp [15:44:25] ^^ are also part of it but that's as far as I got so far [15:45:50] topranks: I think in modules/profile/manifests/installserver/dhcp.pp [15:46:23] I suspect it's building it from modules/network/data/data.yaml actually [15:46:30] rather than synced data from netbox [15:46:39] which is fine I can add them in there [15:46:41] yep [15:46:49] the profile dhcp.pp references network::constants [15:46:54] class network::constants { [15:46:56] $module_path = get_module_path($module_name) [15:46:58] $network_data = loadyaml("${module_path}/data/data.yaml") [15:47:16] yeah, that looks right [15:47:21] cool thanks guys [15:47:35] I knew we improved this it used to have to be added separately for the install server [15:47:49] thought it was direct from netbox. no worries I'll add the new networks in there now [15:47:50] cheers [15:49:56] I'm not sure if there's a good reason for why it isn't direct from netbox [16:02:31] some of the data in that file I think is not 1:1 syncable from netbox. But these ones ought to be, I guess we just need to add it to the hiera export [16:02:49] for now I just kicked the can down the road [16:02:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189898 [16:06:52] thanks for the +1 jesse [16:19:19] I don't effing believe that... it's PXEbooting on first try with the nokia + supermicro combo. Definitely did not expect that to go so well. [16:19:25] these supermicro's are great :D [16:19:27] ahahaha [16:24:34] :D [16:28:43] I take it all back... the pxeboot/installer failed (didn't catch why turned back and it had booted to old os), and now on a second go it's not tried to DHCP [16:28:47] instant karma [16:40:35] 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804#11197934 (10Tacsipacsi) 05Invalid→03Declined Actually, it was a valid request at the time, it just became unnecessary since then. [16:42:09] jhathaway: sorry to bug you, but you might know [16:42:25] this host gets an IP from DHCP fine, then I see this screen [16:42:27] https://usercontent.irccloud-cdn.com/file/0IbYXt1h/image.png [16:43:07] that is the MAC of the NIC port he switch is connected to.... however no packets hit apt1001 from the IP of the host [16:43:17] DHCP is returning url with the IP so DNS shouldn't be a factor [16:43:35] after some time like that it boots from the hard disk [16:46:20] hmm not sure, I wonder if it is provisioned correctly, namely that UEFI http is enabled, I can check, or re-run the provision script [16:48:26] yeah sure we can re-try the provision script [16:48:49] if you have time. tbh the fact the DHCP works is the main bit I wanted to test [16:48:58] *mission accomplished* [16:49:23] I'll re-run the provision script and see if it shows anything... [16:49:29] cool, thanks [16:49:50] if not I can log on as root from the console and change /etc/network/interfaces file to get it online and do my other tests [16:49:56] it has a working OS install it seems [16:53:36] jhathaway: my reimage is still running, I guess I should abort that for now? [16:53:57] oops, yeah [17:00:38] you go through those BIOS options with what I assume must be hard-won familiarity jesse :D [17:01:14] :) [17:01:34] it looks correct now, give it another try topranks [17:02:26] ok let's see how it goes [17:04:36] epic fail [17:05:05] for some reason didn't try network boot that time, went to hd right away [17:05:14] oh wiat [17:05:22] yeah, looks like it only tried one nic [17:05:36] it's rebooting again now [17:05:54] I rebooted it [17:07:17] 75:CE is the mac of the NIC that's connected to switch [17:08:17] ok [17:10:44] I think we should see four lan interfaces in the PCI bios menu, but we only see two [17:10:56] ok [17:11:42] look we can just leave it, unless you want to get to the bottom of it? [17:12:02] I can change the IP manually and get it online for my other tests [17:13:19] I'm happy to look later, if you want to finish your work topranks [17:14:05] give me a few mins I'll change the IP on it so the current install is on the network with right details [17:14:20] I'll probably leave it for the day after that, and pick up again on Monday [17:14:40] I don't think the Nokia switch is causing any problem, DHCP relay is working. It won't do option 82 as you know but that part is not causing an issue. [17:14:51] I've verified it can talk to the apt server etc. also [17:17:02] sounds good [17:20:16] ok that's done [17:20:31] one thing I did realise is I didn't have the IPv6 router-advertisements set up properly on the Nokia switch [17:20:48] I've added that config manually now. I don't expect that would interfere with any of the boot stuff but maybe? [17:20:59] I wouldn't think so either [17:21:09] we'd be into d-i before it should matter yeah [17:21:25] anyway I can ssh to the existing OS now so I'll be able to do some tests [17:21:32] great [17:21:54] if you relish the thought of more supermicro debugging feel free to try a reimage any time :) [17:22:26] :) will do [17:23:42] actually, going back to the tab I had apt1001 tcpdump running that host did connect after it DHCP'd [17:23:44] https://phabricator.wikimedia.org/P83444 [17:24:23] looks like two GETs. No HEAD. [17:25:00] actually nevermind that.... I think that's now that I manually set the IP [17:25:16] the existing OS is probably running apt-update or something [17:25:21] cos it's happening right now [17:25:35] icinga config is complaining about a nonexistent "lsw1-e2-codfw" host group for sretest2009, which seems potentially related to whatever you are doing? [17:27:19] taavi: yeah thanks.... the switch isn't added in Icinga, but it must be trying to make it the parent for the host.. [17:28:24] I guess I need to run the decom cookbook for it or something [18:04:15] jhathaway: this one is truly the gift that keeps giving [18:04:40] the fact sretest2009 is in puppetdb is causing a problem, as Icinga can't find the parent switch [18:05:01] I can't add the parent switch in monitoring as we don't have all the checks for Nokia added there, if I do it'll run the Juniper checks and fail [18:05:24] so I ran the decommission cookbook, which has powered the host down, deleted the IPs in netbox, set state to decom there [18:05:36] but somehow it's still in puppetdb? or at least I can see it in puppetboard [18:19:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [18:25:48] ok running "sudo puppet node deactivate sretest2009.codfw.wmnet" on the puppetmaster seemed to do the trick [18:36:52] sorry topranks, dipped out for lunch, glad you figured it out [19:30:49] FIRING: DiskSpace: Disk space serpens:9100:/ 2.489% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:49:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:02:55] FIRING: MaxConntrack: Max conntrack at 81.24% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:07:55] RESOLVED: MaxConntrack: Max conntrack at 81.24% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:15:34] FIRING: [2x] DiskSpace: Disk space serpens:9100:/ 2.443% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:40:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.646% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:25:16] ^ I truncated some of the logs on serpens to prevent it from running out of disk, will have a closer look on Monday [21:25:34] RESOLVED: [2x] DiskSpace: Disk space serpens:9100:/ 3.641% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace