[01:36:25] 10Traffic, 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3317112 (10tstarling) [11:48:13] 10Traffic, 10Operations, 10ops-codfw: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3318249 (10Volans) [11:48:29] sorry, wrong tag, ignore it [12:39:14] 10Traffic, 10Operations, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3318372 (10Gehel) I can't seem to reproduce the problem from my browser. Looking at the [[ https://grafana.wikimedia.org/dashboard/db/maps-... [13:46:02] 10Traffic, 10DBA, 10Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3318555 (10jcrespo) 05Open>03stalled [14:52:55] lvs1007 is not setup in salt (salt-key not accepted on neodymium), ok to fix or was that intentional? (noticed if since lvs1007 missed the perl security update) [14:55:34] moritzm: it was re-imaged like last friday, and had issues too, so probably a leftover of the problematic reimage [14:56:25] ah, ok. adding the salt key, then [15:49:40] bblack: I remember you mentioning some eth firmware issue the other day? I'm having some issues with bnx2 on stretch d-i on ms-be2017 I was wondering if it might be related? symptom being dhcp not working but I can see the card being unwell [15:49:45] Jun 6 15:46:45 kernel: [ 75.390105] bnx2x: [bnx2x_nic_load:2758(eth0)]Function start failed! [15:49:47] etc [16:07:11] godog: exact same problem [16:07:41] ah, any solution so far ? [16:07:42] godog: I've got 6x hosts that are identical hardware, with identical bios revs and nic mgmt software revs, bought at the same time. 3 of them fail like that, and 3 don't [16:08:20] godog: and the failing ones are fine at runtime with jessie + 4.9 kernel (they still have their old installs, since reinstall never got anywhere). they also pxe/dhcp into the installer fine. [16:08:37] godog: but they fail when they get to the installer's dhcp autoconfig step, because of that bnx2x driver crash [16:08:50] ugh :( [16:09:12] I noticed on the _first_ d-i run the card was actually fine and it got an ip, but then it was detected as eth1 not eth0 and d-i failed anyway [16:09:16] they also installed jessie just fine ~1yr ago [16:09:35] tried to disable the onboard 1G, next d-i run I got that firmware crash [16:10:10] my first thought was that my 3-fail + 3-success machines might differ in BIOS settings that help trigger the crash (e.g. SRIOV or CPU virtualization support or something PCI-related) [16:10:30] but they're HPs and so yeah I haven't been able to get access to the BIOS settings because HP sucks (Java console crap) [16:11:09] or even ioatdma support? who knows, some kind of BIOS diff might account for tripping the bug [16:11:44] but since we know they work ok at runtime with jessie+4.9, I was thinking one avenue to work around this might be to update our installer to use 4.9 initially [16:11:55] but I assume the stretch d-i you're trying already does that? [16:11:57] indeed, I've come across http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04565693&sp4ts.oid=7271259 too but the "disable vt-d from the bios" doesn't really work [16:12:04] yeah 4.9 [16:12:23] I'm trying the second option of disable "hp shared memory features" with my tinfoil hat on [16:12:35] rabdomant stick is ready too [16:12:59] dowsing stick that is [16:13:11] godog: if you have backscroll on this channel through yesterday, you can see my ramblings on some related things [16:13:34] does the d-i image we use include the non-free stuff (firmwares and such?) [16:13:55] I assume so, so it did load a firmware [16:14:23] 13:22 < bblack> Jun 5 12:55:47 check-missing-firmware: installing firmware package /firmware/firmware-bnx2x_0.43_all.deb [16:14:26] 13:22 < bblack> Jun 5 12:56:00 check-missing-firmware: removing and loading kernel module bnx2x [16:14:29] 13:22 < bblack> Jun 5 12:56:00 kernel: [ 24.144539] bnx2x: Broadcom NetXtreme II 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.78.19-0 (2014/02/10) [16:14:32] 13:22 < bblack> [same MSI-X IRQ spam] [16:14:35] 13:22 < bblack> Jun 5 12:56:00 kernel: [ 24.497838] bnx2x 0000:04:00.1: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.8.19.0.fw [16:15:00] bblack: ah yeah I see it now -- thanks [16:15:30] waiting for another reboot with "hp shared memory features" turned off heh [16:17:06] that actually seemed to have worked [16:17:28] but I don't trust it at this point -- will try another reboot [16:21:48] QED [16:21:49] PXE-E63: Error while initializing the NIC [16:21:52] PXE-M0F: Exiting QLogic PXE ROM. [16:21:56] on pxe-boot this [16:29:04] bblack: so the kernel log you posted mentions bnx2x: Broadcom NetXtreme II 5771x/578xx, while lvs1010's dmesg says bnx2x: QLogic 5771x/578xx [16:29:20] that seems interesting :) [16:29:45] the pastes are from a much older kernel [16:29:54] lvs1010 has 4.9, that may be the diff? [16:30:11] qlogic vs broadcom is mostly branding from acquisition, it's the same chips and cards either way [16:30:58] yeah it's just that [16:31:09] lvs1007 now (back on old install with 4.9) also says: [ 4.303576] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.712.30-0 (2014/02/10) [16:31:12] lvs1007 also says QLogic (4.4 kernel) [16:31:16] yeah exactly [16:31:35] but it's running 4.4, not 4.9 [16:31:37] they probably updated the brand name in some pci id list between the old and new kernels [16:31:40] oh? [16:31:45] I guess we hadn't upgraded these [16:31:46] Linux lvs1007 4.4.0-3-amd64 #1 SMP Debian 4.4.2-3+wmf8 (2016-12-22) x86_64 GNU/Linux [16:32:10] I'll upgrade it and see what happens, maybe it will fail and that will tell us something [16:32:38] ok [16:33:23] we also still have to upgrade the LVSs to jessie 8.8 T164703 [16:33:23] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [16:33:44] I'll do that tomorrow if you agree [16:33:55] CC: moritzm [16:34:35] oh, 4.9 still needs manual install, not in normal upgrades? [16:35:31] after dist-upgrade I had to do: apt-get install linux-image-4.9.0-0.bpo.3-amd6 [16:35:39] apt-get install -y linux-meta-4.9 [16:35:42] I think? [16:35:54] we'll see I guess [16:38:30] fyi I was able to complete a stretch reinstall by disabling "hp shared memory features" from the nic configuration page in the bios [16:39:07] yeah, update to 4.9 needs linux-meta-4.9 [16:39:43] ok [16:42:26] godog: stretch? [16:42:44] godog: using java bios, or? [16:43:10] I've tried like 3 different jvm/plugin setups with an older firefox esr and never gotten the thing to launch yet :P [16:43:19] bblack: no from the console with 'vsp' [16:43:22] ema: yeah [16:43:41] godog: are you getting into an actual bios menu from vsp somehow? or just using rbsu cli commands? [16:44:29] bblack: yeah from vsp, esc-9 when it is initializing [16:44:40] godog, bblack: are all the affected machines HP? there's a number of HP machines running trusty, which failed to reimage to trusty until Papaul upgraded the firmware: [16:44:42] https://phabricator.wikimedia.org/T167125 [16:44:49] https://phabricator.wikimedia.org/T166683 [16:44:55] https://phabricator.wikimedia.org/T165739 [16:45:04] might be a similar case, only that we [16:45:18] might be a similar case, only that we're now trying to install an even more recent kernek [16:45:29] right [16:45:34] interesting moritzm, yeah all HP I think so far [16:45:44] afaic my working+failing cases have the same HP bios/firmware/whatever revs, that I can see [16:45:59] godog: when I did esc-9 it just put me in an rbsu prompt, I think? [16:46:23] lvs1007 rebooted fine with 4.9, dmesg says QLogic FWIW [16:47:36] I'll try again [16:47:39] in a little while [16:48:17] regardless, it might be worth having papaul do update magic on lvs1007+ just in case, but maybe a few more cycles of testing on my end first [16:50:07] bblack: yeah the one with "system configuration" and so on, under system configuration it shows on ms-be2016 also items for the 10gbit cards [16:50:14] Embedded FlexibleLOM 1 Port 1 : HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter - CNA [16:50:46] sometimes the up/down keys don't work, resetting the local terminal and trying again makes it work (!) [19:42:49] 10Traffic, 10Operations, 10Interactive-Sprint, 10Maps (Kartographer), 10Regression: Map tiles load way slower than before - https://phabricator.wikimedia.org/T167046#3320381 (10debt) p:05Triage>03Normal [20:37:06] 10Traffic, 10MediaWiki-General-or-Unknown, 10Operations, 10Security-Team, and 2 others: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3320680 (10Jdforrester-WMF) Mass-moving all items tagged for MediaWiki 1.30.0-wmf.3, as that was never released; ins... [23:35:24] 10Traffic, 10DNS, 10Operations: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3321697 (10Ladsgroup)