[05:56:44] 10Traffic, 06Operations: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758#3306726 (10elukey) [05:58:19] 10Traffic, 06Operations: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758#3306738 (10elukey) Host depooled manually, tried to run: ``` root@cp3032:/home/elukey# ifconfig eth0 down [3097418.717749] bnx2x: [bnx2x_del_all_macs:8501(eth0)]Failed to delete MACs: -5 [3... [06:30:09] 10Traffic, 10DBA, 06Operations: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10Marostegui) That is very strange, the last accesses by the top IP we saw yesterday end up at 19:45 for enwiki. The last access that IP has was at "201... [08:28:21] https://www.imperialviolet.org/2017/05/31/skipsha3.html [11:16:23] ema, gehel says you changed something in maps caching recently? [11:17:24] <_joe_> MaxSem: what's the problem? [11:19:26] some maps services are down with errors that look as if URLs get mangled. curling kartotherian directly works [11:19:55] see SAL and T164608 [11:19:56] T164608: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608 [11:20:28] MaxSem: but I'm not fully aware of the current config, I know the work started yesterday [11:20:45] see backlog of this channel (or public logs) [11:23:01] so: URLs like https://maps.wikimedia.org/geoline?getgeojson=1&ids=Q649 are broken [11:23:24] as if the ids parameter is not getting through [11:28:19] eh https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/upload-frontend.inc.vcl.erb#L29 [11:28:55] looks like it needs a if (req.http.host == "<%= @vcl_config.fetch('upload_domain') %>") there too [11:28:56] that's a no go for maps [11:31:57] MaxSem: I can start sending a patch with that based on common sense, but my knowledge of our VCL is very low ;) [11:32:10] bblack: you already around by any chance? ^^^ [11:35:05] https://gerrit.wikimedia.org/r/#/c/356570/ [11:36:05] I'll call ema [11:36:10] but at this point I'm wondering if there could be other issues [11:36:22] like // Look for a "download" request parameter [11:36:54] he didn't pick up [11:37:23] paravoid: hey [11:37:37] ema: issue with maps, see https://gerrit.wikimedia.org/r/#/c/356570/ [11:37:42] hey :) [11:38:02] but maybe there could be other issues, like the download part or in upload_common_recv? [11:38:14] download part being: if (req.url ~ "(?i)(\?|&)download(=|&|$)") { [11:38:53] ema: TL;DR maps is broken, probably because of the maps/upload merge, probably because there's VCL that strips query params [11:39:21] yeah the patch seems fine, let me doublecheck a sec [11:39:33] <_joe_> volans: yeah I'd move the conditional around the other if too [11:39:59] almost all of cluster_fe_recv seems for upload only [11:40:10] MaxSem: while ema is working on that, can I ask you to work on a more functional check for maps? [11:40:12] but I didn't want to touch more things for an emergency patch ;) [11:40:16] the bug's https://phabricator.wikimedia.org/T166735 btw [11:40:35] this is really the kind of thing we should get an alert for [11:41:06] so let's add one :) [11:44:15] eh, all the possible functionality are in spec.yaml - except for stuf that got broken :O [11:45:27] volans: thanks, the patch looks good. Let's merge it [11:45:48] ema: thanks, go ahead [11:45:59] I've amended with joe's suggestion that I fully agree [11:46:22] yup. Merged [11:46:57] filed as https://phabricator.wikimedia.org/T166776 [11:48:36] confirmed, that fixes the problem (tested on cp3045) Forcing a puppet run on upload nodes [11:52:57] MaxSem: https://maps.wikimedia.org/geoline?getgeojson=1&ids=Q649 seems to be working fine for me now, want to double-check? [11:54:25] works! [11:54:42] thanks everybody, I'll create an incident report later [11:54:49] awesome, I'm going back to my lunch. Feel free to call if needed [11:57:22] back to my vacation [12:02:11] oops [12:02:15] sorry MaxSem :( [12:32:54] * gehel is back from the morning off... Thanks Max! [12:33:05] and everyone else! [13:26:39] yay fixed before I even looked [13:26:44] sorry about that :( [13:27:04] we kind of assumed if the leaflet interface worked everything was working [13:28:05] re: the rest of the fe vcl and such, the thinking was that if it doesn't actually harm maps requests, better to avoid tons of pointless conditionals. [13:30:15] bblack: https://gerrit.wikimedia.org/r/#/c/356583/ goes directly against that line of thinking hehe :) [13:30:21] if maps isn't paramless, we should probably avoid the double-slash-stripping, too. I think slashes can have other meaning beyond the ? [13:31:41] ema: it does, but whatever. We can afford another if-condition or two. [13:32:34] bblack: yeah, in particular I had a WTF moment while looking at the VCL during the issue before and seeing the redirect-to-commons part without host conditional [13:33:11] started mumbling how the what the and then find out that we only call synth 666/7 for upload above :) [13:33:53] :) [13:35:23] ema: any bad impacts in the other direction yet? my worst fear was mailbox lag getting notably wose [13:35:26] *worse [13:35:41] bblack: that didn't happen so far [13:36:38] we did have a bunch of eqiad hosts mbox lagging shortly after the merge yesterday, but nothing out of the ordinary I'd say [13:50:58] ema: ok so re: maps, I'll plan to switch DNS later today closer to the 24h mark if nothing else pops up as a concern [13:51:23] and then maybe we'll wait for monday to go further with decom of the hosts and removing all the extraneous puppetization, etc [13:51:38] sounds good [13:51:57] bblack: meanwhile, CR amended to also avoid the double-slash-stripping on maps [13:53:04] ema: virtual +1 (I need to go grab my yubikey before I can do +1s, and I'm still a coffee away from that) [13:53:42] :) [13:54:30] actually coffee is a great idea [14:03:19] sorry people I didn't mention it this morning, but cp3032 had some eth0 troubles [14:03:48] I depooled it and powercycled, but didn't re-added to serving traffic because I wanted to wait for you [14:03:53] there is a task for it [14:04:52] oh, thanks! [14:04:59] T166758 [14:04:59] T166758: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758 [14:06:19] yeah! [14:06:28] sorry I should I have pinged you earlier on [14:08:09] fascinating way to explode (see cp3032:/var/log/kern.log) [14:09:08] Jun 1 04:24:41 cp3032 kernel: [3091924.470794] WARNING: CPU: 2 PID: 0 at /home/zumbi/linux-4.9.13/net/sched/sch_generic.c:316 dev_watchdog+0x220/0x230 [14:09:20] the sounds like the firmware crashing to me [14:10:31] hmmm interesting [14:10:57] we've got a lot of uptime + packets on that particular hardware (the host + the nic), it's odd to see a new error [14:11:15] I'd guess either "new kernel" or "our new fq scheduler stuff" (esp given the sch_generic reference) [14:12:28] or simply dying hardware. usually it's the stupid hardware :-) [14:14:26] yeah usually :) [14:15:36] we can give it a second go if it looks error free post-reboot [14:16:07] if just that host fails again it's probably hardware. if the same error pops up elsewhere eventually, we've probably got a rare new software bug due to new kernels and/or new queueing/scheduling setup [14:32:55] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Operations, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3307646 (10BBlack) Yeah that was the plan, for XKey to help here by consolidating that down to a single HTCP / PURGE pe... [14:45:57] other random upcoming stuff: we should probably move to nginx-1.13.1 sometime soon-ish. it has a few fixups that might be nice to get going with soon: [14:46:49] 1. TCP_NODELAY set for TLS conns, which is nginx's partial mitigation for the same issue that caused us to do a source patch to OpenSSL to raise the buffer size from 4K to 8K. Apparently with the current state of OpenSSL APIs, there is no complete fix possible without patching OpenSSL, but the nodelay may give us a small boost anyways. [14:47:09] no idea about eventbus, but hooking up into kafka should be fairly easy [14:47:27] 2. A few different HTTP2 correctness fixes and one actual bugfix (but I don't think the bug is likely, and the bug may have been introduced in the earlier fixes anyways) [14:47:44] not sure if kafkatee + pipe would scale [14:48:30] 3. Basic config-level support for enabling TLSv1.3, which we need deployed ahead of when we switch to a 1.3-capable OpenSSL so that our config works across that upgrade barrier (after this we can even pre-configure ssl_ciphersuite for 1.3's eventual deploy) [14:49:36] also upstream debian has moved their master branch to 1.13.x, which is kind of a big structural change for them (tracking latest-nginx in master instead of tracking stable there and doing latest in their experimental branch) [14:54:10] it will probably move to unstable in a couple of weeks [14:54:13] post-stretch release [14:54:16] fwiw :) [14:54:22] cool [14:54:32] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Operations, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3307770 (10daniel) @BBlack I have looked into XKey before, and have been wanting to work on this for a while (see T1524... [14:56:29] if you're considering upgrades [14:56:37] think of stretch too :) [14:56:46] not sure when/how it fits in your plans [14:56:50] but just keep it in mind :) [14:57:00] the release is imminent and we're fairly confident about it thus far [15:02:19] yeah I really want to try that on the dns nodes first. we're due for some new ones very soon now in ulsfo+asia anyways [15:29:28] the pybal table fell asleep [15:41:16] 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3307984 (10RobH) [16:07:02] 10netops, 06Operations, 06cloud-services-team, 10ops-codfw: codfw: labtestvirt2002 swith port configuration - https://phabricator.wikimedia.org/T166564#3308052 (10RobH) 05Open>03Resolved a:03RobH done! [20:14:44] 10Traffic, 10Analytics, 10Analytics-Cluster, 06Operations, 15User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3309109 (10Ottomata) [21:09:59] bblack: got a moment to chat about https://phabricator.wikimedia.org/T133178 (www.wikimedia.org vs. wikimedia.org)? [21:22:29] gwicke: not at present, sorry. I did read it yesterday. Sounds complicated for what it is :) [21:23:22] www makes the most sense as the canonical home of a non-existent wiki [21:23:49] and the existing wikimedia.org->www.wikimedia.org redirect makes sense on that level too [21:24:04] yeah, the issue is that the API is not for an existing wiki [21:24:18] so if we ever create a wiki there again, we'd be in trouble [21:24:24] the confounding factor is the existence of some cross/meta-wiki dataset of RBs that lives at wikimedia.org? [21:24:50] yes, without www on purpose [21:24:52] couldn't we just put that elsewhere, since nothing really belongs at https://wikimedia.org/ ? [21:25:02] we could revive the rest.wikimedia.org hostname for that purpose, for instance [21:25:08] but, people find it hard to discover, and they often start by typing www [21:25:24] anyways, I'm running out the door [21:25:37] I'm not sure manual discovery of APIs by typing in random URL is a great use-case for anything :) [21:25:49] we have existing clients, so just moving would be a lengthy process [21:26:07] anyway, lets chat later!