[07:50:20] morning [07:50:43] ema: do you have any experience querying ES api to consume logstash.w.o data? [08:12:55] vgutierrez: hey! [08:14:13] vgutierrez: I did once give it a go, yes [08:14:23] curl "https://$LDAP_USER:$LDAP_PASS@logstash.wikimedia.org/elasticsearch/_search?q=logger_name:varnishslowlog request-Host:en.wikipedia.org" [08:14:37] but yeah I haven't gone further than this :) [08:20:15] yup... also _msearch endpoint is handy [08:20:58] I'm going to adapt my python script to consume that API instead of log files to classify UAs [08:33:43] nice! [09:44:46] nice changes ema :D [09:45:31] let's see if pcc agrees with you! [09:46:32] seems so: https://puppet-compiler.wmflabs.org/compiler02/11107/ [09:46:53] yup [12:36:24] 10Traffic, 10Operations: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865#4181953 (10ema) [12:36:32] 10Traffic, 10Operations: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865#4181964 (10ema) p:05Triage>03Normal [13:02:40] mmh interesting [13:02:52] I've flipped numa_networking: on for canary/misc: https://gerrit.wikimedia.org/r/#/c/430896/ [13:03:04] no changes on cp2006 though [13:04:09] I see in manifest/realm.pp that the flag is ignored `if size($facts['numa']['nodes']) <= 1`, but cp2006 has 2 numa nodes [13:09:21] ema: because , the numa_networking hieradata setting *only* works in per-host hieradata [13:09:52] (because it's picked up as a global in manifests/realm.pp, and that gets dealt with before node definitions that pull in roles, etc) [13:10:33] so much for "declarative" and ordering not mattering, etc :P [13:11:03] mmh I see discovery::app_routes also being accessed in manifests/realm.pp, and yet it's defined in hieradata/common/discovery.yaml [13:12:07] oh but that does not depend on roles, I see [13:13:17] right, I guess I should amend my statement above, it's not that numa_networking doesn't work anywhere but host-level hieradata, it's just that it doesn't work in the most-useful place in hieradata, which would be role-based stuff [13:13:59] hiera_lookup does the right thing tho, mildly infuriating [13:14:04] $ ./utils/hiera_lookup --fqdn=cp2006.codfw.wmnet --roles=cache::misc numa_networking [13:14:07] true [13:14:13] $ ./utils/hiera_lookup --fqdn=cp3032.esams.wmnet --roles=cache::text numa_networking [13:14:18] anyways, in the long term we should kill that hieradata / global and have numa_networking be a feature turned on by roles when they need it. [13:14:31] but for now, we're in transition mode from experimentation on a large important role :) [13:15:27] ema: right, the hiera() lookup would work from anywhere with correct role-stuff. but the templates/manifests that actually vary their behavior on $numa_networking don't have hiera() calls. They use the global $::numa_networking, which is set in manifests/realm.pp from a hiera() call before roles are defined. [13:15:52] we could "fix" that, but then hiera() calls would exist in places they shouldn't by our coding standards [13:41:15] mostly out of curiosity, I've tried to move the call to tlsproxy::instance [13:41:26] but that's not a valid place for hiera calls, right [13:41:33] (though we do have a bunch already) [13:58:02] ema: hmmm ensure => absent on a systemd::service is the same as stopped AND absent? [14:01:05] I think only present|absent are valid values actually [14:01:17] now that does not answer your question, I know [14:02:44] hmmm as long as it stops the service before removing it.. :) [14:03:49] but yeah I'm pretty sure that's the case [14:04:19] nice, I'll merge this later: https://gerrit.wikimedia.org/r/#/c/430911/ [14:04:29] to let our ELK rest for the weekend :) [14:05:37] but IMHO it's the best way to do this kind of data collection [14:10:23] bblack: BTW, do we need to do something else regarding the summit? or reservations are handled for the whole group? [14:13:51] there's usually a travel form for that once the date/destination is final [14:14:25] and when it has decided who's staying a day late/early to provide coverage when some of us are flying [14:23:36] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182294 (10Cmjohnson) @Vgutierrez I did what bblack suggested and switched the cables to the opposite card. Let's see if the magic works [14:29:29] [ 5.387737] bnx2x 0000:04:00.0 enp4s0f0: renamed from eth0 [14:29:29] [ 5.436227] bnx2x 0000:04:00.1 enp4s0f1: renamed from eth1 [14:29:30] [ 5.504292] bnx2x 0000:05:00.1 enp5s0f1: renamed from eth3 [14:29:30] [ 5.556052] bnx2x 0000:05:00.0 enp5s0f0: renamed from eth2 [14:29:48] bblack:, ema, interface naming on the new lvs1013-lvs1016 boxes [14:32:20] yeah, re: travel, travel should send us some emails about booking [14:32:35] probably not today, just approvals were today. early next week? [14:32:41] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182305 (10Vgutierrez) Awesome, I just confirmed the new interface naming for lvs1016: * eth0 -> enp4s0f0 * eth1 -> enp4s0f1 * eth2 -> enp5s0f0 * eth3 -> enp5s0f1 [14:33:01] bblack: ok, just asking cause yesterday it seemed pretty urgent :) [14:33:04] vgutierrez: interface naming looks sane/expected [14:33:41] bblack: yup, we don't have ensXXXX because Dell BIOS doesn't report PCI slots (opposite to HP BIOS in lvs instances @ codfw) [14:33:51] should double-check via lldp or something that port<->switch mapping is as expected I guess [14:33:56] yep [14:34:05] I'll definitely do that later [14:34:29] I think the "s" in ensX means system (or "slot", but the point is they're onboard) [14:34:44] whereas enpXsYfZ is for pci cards (which all of these ports are on) [14:34:53] hmmm nope, onboard are enoXXX [14:35:05] oh, right, I'm confused [14:35:27] so with enpXsYfZ, p must be the pci bus? [14:35:38] yup [14:35:48] bus-slot-fru? [14:35:56] who knows :) [14:36:18] can I log in there yet? I'm curious if the two pci busses are numa-separated too [14:36:30] not yet [14:36:42] oh new_install works though right? [14:36:45] peeking for a sec [14:37:10] right, that should work [14:37:24] [P]ps[f][d] that's the naming :) [14:38:36] not numa-split [14:38:42] root@lvs1016:~# cat /sys/class/net/*/device/numa_node [14:38:42] 0 [14:38:42] 0 [14:38:42] 0 [14:38:42] 0 [14:39:05] although numa-split might have been more-interesting/efficient eventually, but same-as-before is a good thing too :) [14:41:03] ok I'm back off the host. confirmed from lldpd the switch mapping is all correct too [14:41:25] p4f0 -> D, p4f1 -> A, p5f0 -> B, p5f1 -> C [14:42:17] awesome progress on a long-stalled project. we've been trying to replace lvs1001-6 for like 2 years now or something and failing :) [14:43:22] awesome [14:43:34] thx for the lldpd stuff :D [14:43:57] ema: could you unblock the varnishtlsinspector change before going off for the weekend? :) [14:46:55] <3 [14:47:00] looks good! [14:47:43] the `enable => true` thing was tricky, you were asking systemd::service to enable the service and ensure its absence :) [14:49:06] systemd::service should probably say something in that case? 'cowardly refusing to enable a removed service'? [14:49:29] "go home, you're drunk!" [14:49:43] or that, yes [14:49:45] bblack: what's the reason to skip one instance on one subnet on interfaces.yaml? [14:50:28] public1-a-eqiad: lvs1001-lvs1003 and public1-b-eqiad: lvs1004-lvs1006 [14:52:02] because that's their home-row (they're not split across rows like the new ones, there's 3x primaries in b and 3x secondaries in a) [14:52:15] the interfaces.yaml data is everything but the primary host IP/interface [14:52:26] oh right [14:52:37] primary host IP on lvs1001-1006 is the public one and no the private [14:52:42] that was confusing me [14:52:43] thx [14:53:29] apparently the lvs2xxx were set up that way too (3x per row in 2x rows), but we're replacing those this quarter anyways with 4 hosts, we can spread rows better this time around [14:54:26] bblack: BTW, do we know what's a suitable txqueuelen for the new lvs1013-1016? [14:55:12] oh did I never merge the patch that standardized that away? [14:55:37] I think it must've been part of some other patch I ended up abandoning [14:55:49] hmmm [14:56:58] vgutierrez: in any case, 10K for now like current codfw [14:57:11] ack [14:58:22] ideally we shouldn't even be using huge txqlens, but to make up for it we probably need to use a non-default tx qdisc instead, but it's a tricky thing to go test and do, I'm not sure how ipvs-routing and qdisc interact. [14:58:48] for now, we know the existing setup isn't causing us huge issues anyways [15:31:13] bblack: https://gerrit.wikimedia.org/r/#/c/430927/ --> let me know if I'm missing something :D [15:36:41] bblack: I've cooked something to move numa_networking out of realm.pp that seems to work https://gerrit.wikimedia.org/r/#/c/430902/ [15:37:04] I'm not sure how puppet-kosher that is, but at least jenkins seems pleased [15:37:38] actually.. I just added an explicit bgp = no.. just in case :) [15:50:38] ema: yeah but then the next target won't be tlsproxy, e.g. it will be the ipvs hosts, or other software :) [15:53:44] anyways, no matter what it's not "kosher" [15:54:46] really, the basics of numa_networking should be default-on at the interface-rps level, but we have to use hiera to get through the initial transition, then we can just force it on. [15:55:55] (oh and of course the change should not be merged before renaming the setting for the individual hosts that have it on now!) [15:56:04] (now I'm off for real) :) [15:58:52] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182492 (10ayounsi) >>! In T184293#4182305, @Vgutierrez wrote: > Awesome, I just confirmed the new interface naming for lvs1016: > * eth0 -> enp4s0f0 > * eth1 -> enp4s0f1 > *... [15:59:10] vgutierrez: minor -1 on the regexes for bgp:no and low-traffic, but I think just fix those up to make lvs1016 separate from existing regex as a lone host and it's good to go. [16:05:16] bblack: great [16:09:08] bblack: hmmm instead of 1 host regex maybe it would be better to add it to lvs1016.yaml :) [16:09:19] part of the rationale there is we're going to bring up lvs1016 in a temporary role at first, just to emergency patch the lvs1003 situation [16:09:27] later when we have all 4, we'll reconfigure them all differently [16:09:37] yeah better lvs1016.yaml for bgp:no [16:10:05] hmm and for profile::pybal::primary: false [16:10:16] yeah [16:10:22] the regex file is evil and dangerous anyways [16:11:13] (because the first entry in the whole file that matches is the only one that takes effect. if someone adds "lvs*: foo::new::feature: false" at the top, all those lvs entries at the bottom become suddenly-inactive :P [16:11:17] ) [16:14:45] 10Wikimedia-Apache-configuration, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Regression: https://wikitech.wikimedia.org/view/ no longer redirects to /wiki - https://phabricator.wikimedia.org/T193848#4181487 (10Krinkle) [16:21:32] 10Wikimedia-Apache-configuration, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review, 10Regression: https://wikitech.wikimedia.org/view/ no longer redirects to /wiki - https://phabricator.wikimedia.org/T193848#4181487 (10Legoktm) Did wikitech used to have $wgActionPaths enabled? [17:05:43] hmm I missed something regarding BGP config [17:05:44] bgp-peer-address = (unspecified) [17:10:12] bblack, paravoid, we lost the 4 onboard port on cr1-eqsin, quick look seems to indicate a similar problem as before, I also see a drop of traffic, we should probably depool the site [17:12:12] :_( [17:13:54] bblack: https://gerrit.wikimedia.org/r/#/c/430940/ [17:17:47] XioNoX: ack [17:17:52] XioNoX: I guess that you need to do something BGP wise to set up lvs1016, right? [17:18:15] vgutierrez: let's leave it for monday, no point taking risks any further on it today [17:18:22] bblack: sure [17:18:32] bblack: bgp = no will stay till monday [17:18:39] bblack: I just wanted to know next steps :D [17:21:25] btw, lvs1016 is already accesible via SSH [17:28:10] bblack: BTW [17:28:12] Error: /Stage[main]/Profile::Lvs/Profile::Lvs::Interface_tweaks[enp4s0f0]/Interface::Rps[enp4s0f0]/Exec[ethtool_rss_combined_channels_enp4s0f0]/returns: change from notrun to 0 failed: ethtool -L enp4s0f0 combined 16 returned 1 instead of one of [0] [17:30:28] bblack, paravoid, "mrlic: Built-in port license is not installed. Deactivating builtin-ports accordingly." [17:31:32] but config says the license is there, so it might "only" be a software bug [17:37:47] 10Traffic, 10Operations: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4182796 (10ayounsi) [17:38:20] opened https://phabricator.wikimedia.org/T193897 [17:47:43] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4182828 (10Vgutierrez) ``` root@lvs1016:~# ethtool -l enp4s0f0 Channel parameters for enp4s0f0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 15 Current hardware settin... [17:51:03] 10netops, 10Operations: Implement BGP graceful shutdown - https://phabricator.wikimedia.org/T190323#4182848 (10ayounsi) 05Open>03Resolved [17:51:37] 10Traffic, 10netops, 10Operations: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4182849 (10ayounsi) [18:05:45] 10Traffic, 10Operations, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#4182875 (10BBlack) [18:16:07] XioNoX: we should get "I <3 Juniper" t-shirts for the whole team :) [19:49:59] vgutierrez: I downtimed lvs1016 through tuesday just in case. I'm gonna reboot it and see if that clears up the puppet->rps failure setting hardware queue counts on the first eth card... [20:12:00] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4183141 (10BBlack) The key to the ethool difference is this in the lspci stuff: ` Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-` vs ` Capabilities: [a0] MSI-X: Enable+ C... [22:41:32] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#4183398 (10RobH)