[06:49:58] gilles: https://gerrit.wikimedia.org/r/#/c/429843/ can we move this forward now that the python3-logstash issue has been solved? [06:55:57] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4177572 (10Vgutierrez) [07:17:14] vgutierrez: sure, I'll move those changes to the prod patch today [07:19:16] awesome [07:19:41] then I can adopt your changes in my varnishtlsinspector :) [08:04:56] nice [08:05:37] vgutierrez: you seem to have some long running screen sessions on cp[34]030 [08:05:49] not sure if known or leftovers :) [08:07:55] fixed, thx [08:11:20] I highly suspect that there is something missing on lvs1016 setup regarding DHCP/PXE boot [08:12:29] maybe something related to DHCP relays on the network devices [08:15:40] vgutierrez: what error did you get? [08:15:52] no error [08:16:11] and no DHCP traffic (udp port 67) on install[1|2]002 [08:16:42] let me check a couple of things [08:17:07] I just manually set boot @ pxe and power cycled the server [08:19:32] and it started in PXE mode and timedout/failed? [08:20:33] it doesn't fail explicitly [08:21:13] but does it tries PXE? or just boot normally? [08:21:36] doen't boot at all, it's a brand new server [08:22:32] ok, given the custom network config for lvses could be on that side the issue? [08:22:57] netboot and dhcp config on puppet looks ok to me [08:25:50] BIOS states that is attempting a PXE boot [08:30:42] <_joe_> did someone set up the switch ports for that machine? [08:30:51] <_joe_> that's usually the reason why dhcp doesn't work [08:32:59] looks like network to me too [08:40:12] apparently the port is up: https://librenms.wikimedia.org/device/device=149/tab=port/port=15354/view=graphs/ [08:40:56] according to https://phabricator.wikimedia.org/T184293#4137521 lvs1016 "eth0" should be connected to asw2-d:xe-7/0/15 [09:44:04] and capturing traffic on asw2-d I cannot see the DHCP requests.. so I raised the white flag and pinged XioNoX [09:44:30] lvs1016: 1 - vgutierrez: 0 :( [13:24:31] gilles: I've just pushed a version of varnishtlsinspector based on your refactor: https://gerrit.wikimedia.org/r/#/c/430593/ [13:25:44] I added a new method to BaseVarnishLogConsumer, handle_tag(tag, value) to be able to handle custom tags (VCL_Log in my case) [13:26:32] any feedback will be appreciated and welcomed :) [14:52:50] gilles: refactor merged, it works! :) [15:18:21] volans: I'm getting this traceback on some cp nodes, https://phabricator.wikimedia.org/P7076 [15:18:36] ie cp3030 it's crashing but in cp3007 it's working [15:19:11] jessie vs stretch? [15:19:20] or same python version? [15:20:08] (26) cp[3004-3008,3010,3030-3049].esams.wmnet [15:20:08] ----- OUTPUT of 'python3 --version' ----- [15:20:09] Python 3.4.2 [15:20:41] jessie on all cp nodes AFAIK [15:20:45] looks like readline() returns bytes so either use b'\n' or convert to utf-8 I guess [15:20:57] checking [15:21:28] hmm why varnishospital or varnishslowlog aren't affected? [15:21:35] 'A bytes sequence, or a string if run() was called with an encoding or errors.' [15:21:47] did you test them? [15:24:14] still they're alive on cp3030 [15:24:23] s/still/at least/g [15:25:23] * volans wondering if it works if there are only ascii chars involved [15:25:48] no way.. [15:26:01] (╯°□°)╯︵ ┻━┻ [15:26:47] the other option is to pass universal_newlines=True [15:27:04] that returns a string (and should be compatible back to 3.4) [15:29:59] vgutierrez: any difference in LANG, LC_ALL, PYTHONENCODING ? [15:30:10] * volans throwing random ideas [15:32:49] well.. with my ssh session, varnishslowlog doesn't crash and varnishtlsinspector crashes [15:33:02] so LANG, LC_ALL are the same... [15:36:15] oh, and "dpkg -l | grep python3" output is the same in cp3007 and cp3030 [15:37:23] yeah, passing unversal_newlines=True to Popen does fix the problem (tested on cp3030:~ema) [15:38:10] funny because I did run varnishslowlog by hand after merging the refactor and it did produce useful output w/o crashing [15:38:22] *sigh* [15:38:24] :) [15:38:50] let's patch it then [15:38:54] +1 [15:41:31] +1 sorry for not have asked about encoding, I thought it was tested with py3 too [15:43:59] https://gerrit.wikimedia.org/r/#/c/430610/ [15:44:02] this should to the trick right? [15:45:38] yep it should [15:45:50] yeah but what's the diff on the machines? just whether the headers coming into them commonly have non-ascii chars? [15:45:57] (based on cluster?) [15:46:26] right [15:46:34] diff between 3007 and 3030 is that 3007 is misc and has no matching varnishncsa output [15:46:54] vgutierrez: could it be that in your working case there was just oneline in output? [15:47:20] diff between wherever I've successfully tested slowlog and where it fails must be encoding? [15:47:53] volans: I exclude that, I've seen multiple lines of output [15:48:25] hmmm of course changing the library doesn't trigger the service refresh [15:48:41] you need to add that in puppet ;) [15:48:51] it should, yeah [15:49:25] you subscribed only to the script, not the library :D [15:49:57] also we don't have icinga checks that scream if the daemons crash [15:50:40] yup.. same for varnishslowlog and varnishospital [15:51:15] ema: it should, the one for generic unit failure [15:51:37] of course systemd is pretty stubborn and it was trying to activate the unit again and again... [15:53:19] ok: re travel, destination is prague apparently [15:54:08] and we're getting our approvals in by tomorrow [15:54:10] well then I might consider coming by train :) [15:54:33] XioNoX: can work locally with faidon in sfo on timing out their trips (and with travel on the france timing complications, etc) [15:55:43] but in general, we'll all need to get travel plans done + approved by tomorrow sometime [15:56:15] I'm probably the complicating factor, my flight is harder to find and will take longer and block out more time from e.g. ema [15:56:17] noted! [15:56:34] karen is the person handling our trip, btw [15:56:38] XioNoX: ^ [15:56:52] 4 hours train travel time vs airport bullshit (also I can be online on the train thanks to the wonders of LTE) [15:57:18] oh right, good thinking. does that seem pretty definite? because then I don't even have to worry about me+you overlap in the air on my booking [15:58:08] I don't know for sure if our travel dept is equipped to do the train booking for you, it may have to be expensed. karen would know for sure, can ask in travel req. [15:58:10] https://www.google.com/flights/#flt=IAH.PRG.2018-06-16*PRG.IAH.2018-06-23;c:EUR;e:1;sd:1;t:f [15:58:17] poor bblack [15:58:51] eh those aren't that bad. just don't put me on the one with a layover in CDG :P [15:59:00] hahahha [15:59:28] wonderful world.. BCN-->PRG the expensive one is from Ryanair [15:59:45] hey, Paris -> Prague by train is only 13h! [16:00:18] lovely [16:00:27] bblack: yes if taking the train is fine for travel I'd definitely prefer that to flying [16:00:52] ema: I'm sure it's fine in general, I just don't know if, unlike flying, they might not book+pay it for you, and you'd have to expense it instead. [16:01:02] bblack: so I don't care about dates, feel free to adjust them to suit the availability requirements [16:01:11] that's fine! 20 bucks [16:01:14] nice! [16:02:36] vgutierrez: is now a good time to look at lvs1016? [16:02:42] XioNoX: sure [16:02:51] I can learn a few things :D [16:03:08] my debugging so far has been futile [16:03:24] so please... enlighten me [16:03:45] vgutierrez: can you turn the host on if it's not already? (aka actively sending dhcp requests) [16:04:19] hmmm [16:04:28] I can force a power cycle with PXE boot [16:04:35] so we should have some DHCP requests [16:04:44] cool [16:05:17] done [16:05:28] and do you have the mac address of the port that should send those requests? [16:05:34] errr [16:05:44] yes and no :) [16:06:10] F4:E9:D4:DB:25:40 [16:06:10] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [16:06:14] it should be that one [16:06:22] but I couldn't confirm it [16:06:34] 1st useful command: `show interfaces xe-7/0/15` check if the link is up, last flapped, and traffic [16:06:44] link it's up [16:06:48] only outgoing traffic [16:06:57] (I've checked that this morning) [16:07:05] also monitor interface xe-7/0/15 is useful [16:07:27] at least the link flaps when I hit the power cycle on lvs1016 [16:07:33] 2nd useful command `show ethernet-switching table interface xe-7/0/15` it should display the host's mac and its vlan as soon as the switch sees any kind of traffic [16:07:48] XioNoX: but it doesn't show anything :( [16:07:57] I checked that already :_( [16:08:25] I also checked the mac-learning log [16:08:34] but it came empty as well [16:08:44] * vgutierrez tried before asking [16:09:04] so 1/ we need to make sure dhcp queries are coming out of the good interface, as I think there are 4 ports? [16:09:19] right, 4 interfaces [16:09:25] maybe we have our ideas about the 4x ports wrong? the primary interface the host is using for initial boot/install is different than what chris thought was "eth0"? [16:09:58] XioNoX: can we sniff the 4 ports? O:) [16:10:05] also, sometimes PXE isn't enabled correctly for the interface in the card setup. I think it's Ctrl+S for broadcom setup on the card during boot to go into that firmware settings menu? [16:10:38] (and then pick the first one on the list) [16:10:41] hmmm [16:10:58] vgutierrez: yeah, we can [16:11:21] so.. we could check and sniff the 4 interfaces or check the BIOS like bblack is suggesting [16:11:26] I mean, look if there are any outbound packets on any of the 4 interfaces [16:12:03] right, if it's just confusion on which of the 4x ports, the traffic is showing up on another switch somewhere (where there's no default vlan to send it to) [16:12:17] and we can fix that by having chris swap the cables around on the host side. [16:12:48] BTW [16:12:49] https://librenms.wikimedia.org/device/device=149/tab=port/port=15354/ [16:15:26] vgutierrez: alright, I'm looking at the 4 interfaces, please reboot when you can [16:16:09] XioNoX: done [16:18:04] I'm seeing in traffic on the link to asw2-b-eqiad [16:18:28] that's eth2 IIRC [16:18:47] lvs1016 eth2/ens1f0 #3931 [16:19:12] does that mean eth2 is eth0 ? :) [16:19:29] or that PXE is misconfigured? [16:19:30] so that means the cards are "backwards" probably [16:19:38] move the eth2/3 cables to eth0/1 and vice-versa [16:20:09] you can double-check that PXE is set to the primary interface, but I think it's hard to accidentally get that wrong [16:20:19] (the first one in the list by BIOS/firmware lists) [16:20:21] I'll tell Chris that he should check the cables then [16:20:37] thx XioNoX :) [16:20:42] probably chris assumed a certain order based on slot# or whatever, and the bios/OS sees it the other way around. [16:21:00] XioNoX: BTW, which mac address are you seeing for that port? [16:21:11] no ports are onboard in this case, it's 2x cards, so they're all ensXfY [16:21:20] (IIRC) [16:21:35] (the real onboard port is supposedly bios/firmware-disabled) [16:22:09] bblack: at least ipmitool doesn't report any NIC on the delloem extensions [16:22:19] so it must be disabled [16:22:22] right [16:22:28] to avoid things getting truly-confusing :) [16:22:36] I checked that as well this EU morning [16:22:48] vgutierrez: that switch port is configured for trunk only (no native) so it's ignoring the untagged frames [16:23:07] right that makes sense [16:23:14] XioNoX: hmmm and monitor traffic inteface --layer2-headers doesn't help, right? [16:23:29] *interface [16:23:38] but we do the need the primary port to be same-row for various sanity/failover concerns, so we do need to fix it at the "swap the cables on the back of the host" level, rather than hacking around the current cabling in config. [16:24:21] vgutierrez: nah, monitor traffic interface only shows traffic for the switch's control plane, not transiting [16:24:42] XioNoX: hmmm it should set the interface in promiscuous mode [16:24:49] but you're the expert :) [16:27:43] still only related to the routing engine (control plane) [16:29:15] oh I get it now [16:29:54] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4178915 (10Vgutierrez) @Cmjohnson, I've been trying to boot lvs1016 with PXE with no luck, after some debugging with @ayounsi we've seen traffic incoming traffic on eth2 (asw... [16:33:07] bblack: https://logstash.wikimedia.org/app/kibana#/discover/958769b0-4eef-11e8-8e04-89a38b6a810e?_g=() [16:33:23] bblack: tls data for AES128-SHA users it's already showing up :D [16:40:53] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4178965 (10Cmjohnson) @Vgutierrez I flipped the cables. I did put the cables into what is on the card labeled port 1 and port 2 but I think the card is inserted upside down o... [16:44:10] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4178982 (10BBlack) I don't think it was a flip of the two ports on the same card that was needed, but instead switching all the cables between the two cards (order of cards,... [17:01:28] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4179058 (10Vgutierrez) Ok... this is the current picture from what I see: eth0 is still connected to asw2-b:xe-4/0/34 instead of asw2-d:xe-7/0/15 asw2-c:xe-4/0/5 is showing n... [17:08:55] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4179083 (10BBlack) [I still bet if you undo the already-done cable swap, and then switch the two cards' cables (leaving port1/2 ordering the same), this will all magically co... [17:10:06] /o\ [17:14:19] bblack: summit/offsite/Prage travel dates sorted out? [17:14:50] vgutierrez: yes on the spreadsheet. I'm still not clear on whether we should file Individual travel reqs or if a group thing is coming [17:15:59] hopefully mark will tell us how we should proceed O:) [17:16:10] (but should get some clarity on that sometime during the US day today I think) [17:16:16] lovely [17:49:35] 10Traffic, 10Operations, 10Patch-For-Review: Gather 24h data cluster wide of AES128-SHA usage - https://phabricator.wikimedia.org/T193376#4179327 (10Vgutierrez) Data is currently being gathered, it can be seen here: https://logstash.wikimedia.org/app/kibana#/discover/958769b0-4eef-11e8-8e04-89a38b6a810e?_g=() [18:36:13] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Refactor varnishospital and varnishslowlog - https://phabricator.wikimedia.org/T193489#4179457 (10Gilles) 05Open>03Resolved [19:13:22] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4179677 (10jeblad) [23:00:55] 10netops, 10Operations, 10decommission, 10ops-eqiad: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#4180565 (10RobH) [23:02:04] 10Traffic, 10Operations, 10decommission, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#4180571 (10RobH) [23:44:30] 10Traffic, 10Operations, 10hardware-requests, 10ops-esams: Procure and install LVS and miscellaneous servers - https://phabricator.wikimedia.org/T184068#4180674 (10RobH) 05Open>03Resolved This is now being tracked via the procurement task, T183413. [23:59:22] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#4180716 (10RobH) 05Open>03Resolved racktables now has this as https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=3546