[12:52:43] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3314674 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can... [12:53:24] so, where I left off friday was that lvs100[789] wouldn't come up in wmf-auto-reimage [12:53:29] trying again with just lvs1007 now [12:55:25] that all 3 acted the same means it's something systemic, probably about ethernet port setup somewhere [12:55:39] they made it through the PXE/dhcp and into the installer, but the installer was failing at ethernet autoconfig... [12:56:02] bblack: so nothing specific to the reimage script? let me know if you want me to have a look [12:56:05] I'm starting to think maybe the OS is getting "which port is eth0" wrong [12:56:33] (which is odd, because they were working fine at runtime with correct port mapping, before reimaging) [12:56:54] was udev remapped manually maybe? [12:57:16] I don't think so, but it was like a year ago when they were last reinstalled [12:57:21] ok [12:58:31] plus the way the switches are configured, all the non-eth0 ports don't have a native vlan in the switch config (the switch config requires them to speak vlan trunking) [12:58:43] so I think the dhcp that's working for initial PXE must be working on eth0 too [12:58:51] mostly unrelated, icinga is complaining about puppet.service failed on lvs1010... a reset-failed should fix it but I'm wondering why puppet ran as a daemon at some point [12:59:07] lvs1010 was reinstalled friday and left mostly untouched I think [12:59:15] probably just needs some followup cleanup, I'll take a peek [13:00:05] ok, thanks [13:00:07] it was a multi-host reimage with lvs1007-10, but only 1010 made it through the installer. I probably canceled out waiting on the other 3 before the script could finish up nits for lvs1010 post-install [13:01:52] ok, in case you want me to have a look at the reimage logs just let me know ;) [13:02:24] ok thanks [13:07:36] same with anything network related :) [13:16:20] ok so now I'm starting to understand why none of the simple things I looked at made a difference [13:16:35] apparently the bnx2x driver/firmware is crashing and stuff during the installer bootup :( [13:17:30] it could be some kind of scenario like: under runtime kernel 4.9, we loaded some very new matching firmware onto these cards, and now with whatever different kernel the installer is using, the new firmware + old driver is a broken combo [13:18:52] even after it initially fails autoconfig, if I go put the right ip details in as static config, this is what happens in the installer syslog: [13:19:06] Jun 5 13:06:56 debconf: --> GET netcfg/confirm_static [13:19:06] Jun 5 13:06:56 debconf: <-- 0 true [13:19:06] Jun 5 13:06:56 netcfg[1896]: INFO: Taking down interface lo [13:19:06] Jun 5 13:06:56 netcfg[1896]: INFO: Taking down interface lo [13:19:06] Jun 5 13:06:56 netcfg[1896]: INFO: Activating interface eth0 [13:19:09] Jun 5 13:06:56 kernel: [ 680.856107] bnx2x 0000:03:00.0 eth0: using MSI-X IRQs: sp 90 fp[0] 92 ... fp[7] 99 [13:19:12] Jun 5 13:06:56 kernel: [ 680.965332] bnx2x: [bnx2x_nic_load:2716(eth0)]Function start failed! [13:19:15] Jun 5 13:06:56 netcfg[1896]: INFO: executing: ip addr add 10.64.1.7/22 broadcast 10.64.3.255 dev eth0 [13:19:18] Jun 5 13:06:56 netcfg[1896]: ip: RTNETLINK answers: Network is unreachable [13:19:34] much earlier there was tons of spam about driver errors when it tried autoconfig setup [13:19:41] has lvs1010 a different network card? [13:20:10] no they're all the same hardware [13:20:26] so how did it made it through :) [13:20:33] but they might've been in different running-kernel/firmware states, possibly [13:20:56] some relevant snips from the initial kernel messages for bnx2x on lvs1007: [13:20:59] Jun 5 12:55:42 kernel: [ 2.655587] bnx2x: Broadcom NetXtreme II 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.78.19-0 (2014/02/10) [13:21:38] [bunch of MSI-X IRQ info for 4x interfaces, then this for each of eth0-3:] [13:21:41] Jun 5 12:55:45 kernel: [ 8.877545] bnx2x 0000:03:00.1: firmware: failed to load bnx2x/bnx2x-e2-7.8.19.0.fw (-2) [13:21:44] Jun 5 12:55:45 kernel: [ 8.877551] bnx2x 0000:03:00.1: Direct firmware load failed with error -2 [13:21:47] Jun 5 12:55:45 kernel: [ 8.877553] bnx2x 0000:03:00.1: Falling back to user helper [13:21:50] Jun 5 12:55:45 kernel: [ 8.878087] bnx2x: [bnx2x_func_hw_init:5506(eth1)]Error loading firmware [13:21:53] Jun 5 12:55:45 kernel: [ 8.878098] bnx2x: [bnx2x_nic_load:2685(eth1)]HW init failed, aborting [13:22:02] Jun 5 12:55:46 check-missing-firmware: missing firmware files (bnx2x/bnx2x-e2-7.8.19.0.fw bnx2x/bnx2x-e2-7.8.19.0.fw bnx2x/bnx2x-e2-7.8.19.0.fw bnx2x/bnx2x-e2-7.8.19.0.fw) for bnx2x bnx2x bnx2x bnx2x [13:22:06] Jun 5 12:55:47 check-missing-firmware: installing firmware package /firmware/firmware-bnx2x_0.43_all.deb [13:22:09] Jun 5 12:56:00 check-missing-firmware: removing and loading kernel module bnx2x [13:22:14] Jun 5 12:56:00 kernel: [ 24.144539] bnx2x: Broadcom NetXtreme II 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.78.19-0 (2014/02/10) [13:22:43] [same MSI-X IRQ spam] [13:22:45] Jun 5 12:56:00 kernel: [ 24.497838] bnx2x 0000:04:00.1: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.8.19.0.fw [13:23:03] Jun 5 12:56:07 kernel: [ 30.805869] bnx2x 0000:03:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [13:23:20] Jun 5 12:56:17 kernel: [ 41.271175] bnx2x: [bnx2x_state_wait:308(eth0)]timeout waiting for state 9 [13:23:23] Jun 5 12:56:18 kernel: [ 42.093982] bnx2x 0000:03:00.0 eth0: using MSI-X IRQs: sp 90 fp[0] 92 ... fp[7] 99 [13:23:26] Jun 5 12:56:18 kernel: [ 42.218861] bnx2x: [bnx2x_attn_int_deasserted2:4131(eth0)]FATAL HW block attention set2 0x20 [13:23:29] Jun 5 12:56:18 kernel: [ 42.218867] bnx2x: [bnx2x_attn_int_deasserted2:4132(eth0)]driver assert [13:23:32] Jun 5 12:56:18 kernel: [ 42.218870] bnx2x: [bnx2x_panic_dump:929(eth0)]begin crash dump ----------------- [13:23:35] Jun 5 12:56:18 kernel: [ 42.218874] bnx2x: [bnx2x_panic_dump:939(eth0)]def_idx(0x1) def_att_idx(0x2) attn_state(0x1) spq_prod_idx(0x2) next_stats_cnt(0x0) [13:23:38] Jun 5 12:56:18 kernel: [ 42.218877] bnx2x: [bnx2x_panic_dump:944(eth0)]DSB: attn bits(0x0) ack(0x1) id(0x0) idx(0x2) [13:23:41] Jun 5 12:56:18 kernel: [ 42.218879] bnx2x: [bnx2x_panic_dump:945(eth0)] def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0) igu_sb_id(0x0) igu_seg_id(0x1) pf_id(0x0) vnic_id(0x0) vf_id(0xff) vf_valid (0x0) state(0x1) [13:23:45] .... [13:24:17] I guess I'll do some googling, but given the situation that lvs1007-9 all failed and lvs1010-12 all succeeded, and they are supposedly 6x of the same hardware... [13:25:05] yeah it's tricky to logically figure out what's gone wrong on just the first half here, except maybe that we might've upgraded/changed firmware on one half and not the other and caused some compat issue with the installer kernel or something dumb like that [13:26:18] on lvs1010 (fully installed) the driver is: [ 4.835904] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.712.30-0 (2014/02/10) [13:26:34] and: [13:26:34] [ 18.066418] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw [13:27:33] 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) [13:27:37] ^ installer kernel heh [13:28:07] still, it claims it succeeded in loading up the old firmware, you'd think at worst it would be an extra reboot cycle to get over this [13:28:55] ehehe [13:34:23] so for that exact driver assert issue, there's a google result from supermicro saying it required a bios update to fix [13:34:30] maybe they have different HP bios revs... [13:36:52] there's also a related kernel driver bugfix commit, but it first appeared in upstream vanilla kernel release 3.8, so you'd think we already have it [13:37:00] ( b343d0025b08a1ef543e3cabf8b753d84b938d48 ) [13:38:02] and should be the same between all those lvses [13:41:03] yeah, but the BIOS updates might not be [13:41:56] the HPs are so much slower with vsp than dell console :P [13:42:13] lol [13:43:33] bblack: is codfw acting as a caching pop when it's not the master, or not serving traffic at all? [13:44:35] it's a cache, yes [13:44:46] ok, thx [13:44:55] the traffic stuff is active/active/active/active regardless of the applayer stuff, basically [13:45:09] noted! [13:50:08] hmm lvs1007+lvs1010 seem to have identical bios revs in rbsu: [13:50:09] Product ID: 655651-B21 [13:50:09] HP BIOS P71 11/01/2014 [13:50:09] Backup Version 11/01/2014 [13:50:09] Bootblock 03/05/2013 [13:56:02] bblack: FYI https://gerrit.wikimedia.org/r/#/c/357207/1/hieradata/role/common/swift/proxy.yaml in case you run into similar issues with upload's tlsproxy [13:57:07] uh [13:57:10] the other option IMO would be to have /var/lib/nginx on disk as opposed to tmpfs, but since nginx is going to just proxy it on localhost it seems moot [13:57:39] I remember there was a patch for this to tlsproxy recently, did it change the terminators' config? [13:57:49] no, the default stayed the same [13:58:22] ok yeah I see it now [13:58:55] I think the reason it's always been small (100m) on caches is that large uploads are supposed to use chunked encoding (to which that limit doesn't apply) [13:59:14] (iirc) [13:59:39] ah I see, yeah that'd make sense, ditto for mw to be using chunked uploads for server side uploads [14:00:50] I think I'll disable the spooling for now on swift and then see if mw could switch to use chunks instead across the board [14:23:48] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3314892 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can... [14:45:46] bblack: turns I was missing another nginx parameter to turn off buffering, https://gerrit.wikimedia.org/r/#/c/357218 I'm not sure if localssl vs nginx.conf is better but since other buffering parameters are in localssl I went with that [14:47:59] godog, note also from the nginx wiki about that setting: [14:48:01] When HTTP/1.1 chunked transfer encoding is used to send the original request body, the request body will be buffered regardless of the directive value unless HTTP/1.1 is enabled for proxying. [14:48:21] but otherwise I guess I don't see a big problem with the setting [14:49:31] bblack: *nod* yeah when/if mw uses chunked transfers for uploads internally then we'd be ok and can possibly even revert [14:50:14] well, I think the directive still applies then, and you might still want it off [14:51:08] I think their point is that if the client-facing and server-facing protocols are either both 1.0 or both 1.1 or client-facing:1.0+server-facing:1.1, the directive controls buffering. [14:51:40] but in the client-facing:1.1+server-facing:1.0 scenario, buffering is universally turned on (because it has to buffer to translate chunked to http/1.0 non-chunked) [14:52:05] so it might be a good idea to ensure the server-facing side of the nginx proxy is speaking 1.1 explicitly to swift. [14:52:37] oh, and now that I look, I think that's unconditional in localssl [14:52:54] (it used to be part of the many strange optional things for tlsproxy that are config/hieradata-driven) [14:52:57] proxy_http_version 1.1; [14:53:13] so nevermind all that, it's already forced as 1.1 on the server-facing side [14:54:46] ack, thanks for the explanation though! [15:03:34] re lvs1007, if I let it reboot back to its old install, it still works fine [15:04:12] so there's some bug with the kernel+driver of the installer and these machines. it's just... very odd that it only impacts lvs1007-9 and not 10-12 [15:04:29] that seems to imply we made some change at some level that affects tripping this bug, only to those 3 [15:04:41] but I can't for the life of me find a diff yet [15:04:52] they're running the same bios, the nic ctrl+s settings are the same, etc [15:07:33] I guess I'll set up the tunnel proxy to mgmt network and try going through all the bios settings via browser (ewww) [15:15:33] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965#3315104 (10Volans) [15:15:46] bblack: FYI ^^^ [15:15:53] <_joe_> bblack: ewww :P [15:16:01] <_joe_> that sounds like a pain indeed [15:16:43] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3315106 (10Volans) [15:17:14] is the md RAID done with the same physical disks of the megaraid? [15:23:06] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166964#3313134 (10Volans) Relating it to T166965 [15:25:35] and of course, the latest firefox for linux has disabled npapi since ~March [15:25:52] which breaks the ability to use a java plugin, and HP's web-based remote console options are only java or .net :P [15:32:40] why can't they just have a decent serial console like the dells that lets you go into bios settings :P [15:39:11] ah? I thought all bios settings were available on console on HPs too [15:39:38] well they are technically, but only through a very clunky CLI settings thing [15:40:07] SHOW PCI DEVICE ENABLE/DISABLE [15:40:12] SET PCI DEVICE ENABLE/DISABLE 0 [15:40:14] etc... [15:40:29] but there's no apparent way to just view all of the normal bios config and compare it between two servers that I've seen yet, via the CLI [15:41:03] SET CONFIG INTEL(R) HYPERTHREADING OPTIONS 1 [15:41:07] crazy crap [15:41:21] ah yeah via the cli I don't know [15:45:22] the https interface has a bunch of info and diagnostics and whatnot, but no actual menu of bios settings [15:45:36] for that, you use the web menu option to launch a remote console (java or .net) [15:46:13] so now I'm like 34 levels into this rabbithole and trying to figure out how to get FF 52 ESR (32-bit) installed and java working on it :P [15:52:31] next thing you know and debugging local machine bios settings via jtag is involved [16:11:41] bblack: btw if https://gerrit.wikimedia.org/r/#/c/357218 looks good to you I'll merge [16:13:19] yeah lgtm [16:13:24] I got logged ou tof gerrit again :P [16:13:57] ok there we go [16:14:35] yeah I was logged out this morning too, not quite sure what causes it heh [16:14:38] thanks tho [16:46:56] 10Traffic, 10Operations, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3315566 (10MaxSem) [19:41:44] zayo circuit down between ulsfo and codfw, no impact (trafic routed around), zayo is working on it [19:46:35] 10Traffic, 10Operations, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3315566 (10BBlack) Are you comparing cache hits to cache misses? From where? What was the timing like before? [19:53:17] 10Traffic, 10DNS: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10Bawolff) [19:57:52] 10Traffic, 10Operations, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3316274 (10BBlack) Another thought - could we be maxing out parallel connections to the kartotherian machines? We've always had a `max_con...