[01:30:31] 10netops, 10Operations: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317#4192939 (10ayounsi) The last change to be applied: ```lang=diff + route-filter 0.0.0.0/0 prefix-length-range /25-/32; - route-filter 0.0.0.0/0 prefix-length-range /27-/32; ``` Will cause 135 invalid pref... [02:08:13] 10netops, 10Operations: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317#4192977 (10ayounsi) 05Open>03Resolved All done! [09:59:42] well... I fixed the MSI-X / rps issue on lvs1016 \o/ [10:02:41] yeee! :) [10:03:04] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4193457 (10Vgutierrez) So MSI-X limit can be changed on the NIC BIOS, it was set to 16 for enp4s0f0, after setting it to 32 **and power cycling** the server, lspci showed the... [10:04:50] MSI-X on the NIC BIOS was limited to 17... one ethernet was honoring the BIOS limit and the other one was ignoring it [10:05:15] I set it to 32 in both NICs and *after* power cycling the server, the issue was solved [10:08:05] so lvs1016 it's ready.. I need to talk with XioNoX to let it announce BGP routes but that's it :D [10:11:19] that's awesome [10:15:16] so I've added a dashboard that shows the current overall true hitrate compared to last week https://grafana.wikimedia.org/dashboard/db/varnish-caching-last-week-comparison?refresh=15m&orgId=1&from=now-1h&to=now&var-cluster=text&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [10:16:29] that's quite useful together with the varnish restart annotations to keep an eye on how bad your manual (and those done by cron!) restarts are affecting the hitrate [10:17:18] those annotations are pretty cool :D [10:17:26] if you go over the red arrow at the bottom of the annotation with your mouse pointer you see which host's backend has been restarted and which cluster it belongs too [10:17:43] :clap: [10:43:31] <_joe_> ema: how did you import those annotations? [10:43:55] <_joe_> it's pretty neat [10:44:48] _joe_: yes! I'm currently documenting the feature :) [10:45:24] <_joe_> ok [10:45:55] <_joe_> we met with the grafana team @kubecon, they told us they're using our dashboards as a showcase from time to time [10:46:15] <_joe_> and, they want us to upgrade to grafana 5 :P [10:46:28] nice [10:47:21] ema: that's another prometheus query right? [10:47:35] volans: correct [10:47:56] I most of the annotation will not end up there, for flexible and high load annotation I would suggest elasticsearch as abackend [10:48:18] for things like deploys and such [10:48:49] there was also some related discussion on T175708 [10:48:49] T175708: Add annotations per URL tested in WebPagetest - https://phabricator.wikimedia.org/T175708 [11:08:45] volans: what do you mean with "most of the annotation will not end up there"? [11:09:15] that to be able to use as annotation prometheus you need to want to annotate with something you have already there [11:09:42] and I don't think that things like deployments and such should be saved in prometheus as metrics, they are clearly not [11:09:51] ah I see what you mean [11:09:51] in this case is neat that you already had the data there [11:11:36] we shouldn't add prometheus metrics to for things that are not a metric, that makes sense [11:12:10] s/to // [11:12:22] yeah [14:11:26] bblack: so, reviewing T147202 you mentioned the possibility of running a CentralNotice campaign to warn users and ask to their $sysadmin to upgrade their evil boxes [14:11:26] T147202: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202 [14:12:48] almost a 60% of the AES128-SHA traffic is under that scenario, so maybe it's worth the effort [14:15:12] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4194193 (10Rduran) hpenc looks interesting, so maybe we can keep it in mind for future improvements. [14:16:45] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4194198 (10jcrespo) Yes, I was not suggesting to do it now, just document the suggestion for the future- or maybe they can even set it up for us in parallel. Changing the algorithm, assuming o... [14:19:06] vgutierrez: the tradeoff is our simplistic page replacement is easier for us to manage. CN won't work at all for the legit ancient UAs, so we have to do that anyways. The ? is whether it's worth the effort to do the nicer CN variant for the non-legacy UAs (which involves other teams and wiki admin and finding a pathway to trigger it via varnish headers, etc) [14:20:45] vgutierrez: I'm around if you want to do BGP stuff [14:22:25] XioNoX: great :D [14:24:48] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4194211 (10Vgutierrez) in stretch chacha20 is available as "chacha20" and in jessie as "chacha20-poly1305", BTW for big enough block size (16384 bytes), chacha20 performs better than rc4 on on... [14:25:12] XioNoX: what do you need from me? [14:26:43] vgutierrez: the IP pybal is listening on [14:27:47] bgp-local-ips = [ '10.64.49.16' ] [14:29:18] bblack: regarding our /sec-warning page, we need to reach somebody that helps us translating the new message or just go with english [14:29:40] IIRC the PS3 traffic was coming from JP [14:30:00] let me run and old GeoIP lookup on the captured IPs and let's see what we got [14:30:44] vgutierrez: is pybal configured and ready to go? (aka will it start advertizing prefixes as soon as I add the router config? [14:31:04] nope [14:31:10] it has BGP disabled :) [14:31:29] but otherwise is ready to go [14:31:46] right now is pointing to cr2-eqiad [14:31:58] cool, I'll add the router config on cr1-eqiad [14:32:00] ah [14:32:06] then cr2 :) [14:32:07] I can switch the router [14:32:34] just let me know [14:32:35] doesn't really matter, which other LVS is it master/backup with? [14:32:59] right now is being set as a third LVS for 1003 and 1006 [14:35:57] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4194233 (10Rduran) Thank you both! I'm using "chacha20" right now and it seems to work just fine (I'm using buster, but stretch is also on 1.1.0). Does jessie need to be supported too? [14:36:10] vgutierrez: neighbor added on cr2 [14:36:32] 10Traffic, 10DBA, 10Operations: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4194234 (10jcrespo) No, stick to stretch, that is ok- that is the target. [14:36:55] XioNoX: once it's up and running and confirmed that all is sane, we're going to switch it in as primary (and make it the static default), with the legacy 1003/1006 as the backups. [14:37:05] let me know when you enable BGP so I can verify prefixes, etc... [14:37:16] bblack: sounds good [14:37:18] going to disable puppet and do it manually [14:39:50] XioNoX: restarting pybal with BGP enabled.. [14:40:23] May 09 14:40:03 lvs1016 pybal[18603]: [bgp.BGPFactory@0x7f3c8f76f638] INFO: BGP session established for ASN 64600 peer 208.80.154.197 [14:40:26] looking good here [14:40:43] established, receiving the same amount of prefixes than 1006 [14:40:56] MED of 100 [14:41:03] (like 1006) [14:41:21] XioNoX, bblack: BTW, in that scenario we will have 1 route with MED 0 and 2 routes with MED 100, so I guess the two secondary lvs instances are going to play active-active? [14:41:29] https://www.irccloud.com/pastebin/dQow3Mm4/ [14:41:48] vgutierrez: no, we're going to have to transition that, which is tricky [14:42:08] hmm some puppet magic is needed then :) [14:42:09] (we don't do A/A from router->lvs ever yet, although it's a future possibility with the router hashing ECMP) [14:42:45] the way we'll have to handle the transition is some variant of this: [14:43:16] 1) puppetize the switch of primary from lvs1003 to lvs1016 (which effectively pushes the MED changes so that lvs1003 becomes 0 and lvs1016 becomes 100) [14:43:25] 2) Merge that with puppet disabled on them. [14:43:39] lvs1003 --> MED 100 and lvs1016 --> MED 0 :) [14:44:22] right, backwards above. either way, we have the same basic issue. [14:44:37] which is we can't have two equal meds advertised at the same time [14:44:49] so restarting my proposed steps list: [14:44:55] XioNoX: I'm disabling BGP on lvs1016 BTW [14:45:48] vgutierrez: for how long? (aka should I disable the router side or leave it for now?) [14:46:14] 1) puppetize the switch of primary from lvs1003 to lvs1016, which effectively changes MEDs like: lvs1003:0->100, lvs1006:100->100, lvs1016:100->0 [14:46:26] 2) Merge that with puppet disabled on all 3 [14:48:50] ok (3) is really hard to have a sane solution for heh [14:48:59] bblack: hmm IMHO makes sense to switch primary as you said + disable BGP on lvs1006 [14:49:00] if it helps I can also add static routes on the routers the time of the transition [14:49:13] XioNoX: I think we'll have to use static to transition [14:49:45] there's two layers of problems: the final dual med=0 can be solved via decomming one of the backups (before or after), so not a big deal. [14:50:10] but the bigger problem is not having any moment where the old primary and new primary are showing the same MED during the deployment/switch. [14:50:21] there's no sane way to step through that which doesn't cause an interruption. [14:50:22] bblack: with puppet disabled we can define one as med=0 and the other one temp. as med=1 [14:50:41] as long as it's < 100 it's going to asume the primary role [14:50:56] oh that makes sense, duh, use a middling value. It will still be multiple switches of traffic though I think. [14:51:20] (because eventually 1016 gets puppetized to 100 with a restart) [14:51:25] hmmmm [14:51:30] nope, no restart involved [14:51:34] puppet won't restart pybal [14:51:41] well yes, but we do have to do it eventually [14:52:18] ok, how about this: [14:52:30] yup, at some point we should switch lvs1016 from med=1 to med=0 for consistency's sake [14:54:07] 3) Manually set lvs1016 to med=101 (lower prio than lvs1006, avoids splitting traffic when 1003 goes offline), and restart pybal there [14:54:46] 4) Manually set lvs1003 to med=1, and restart pybal there (blips traffic to 1006 and back again, but luckily not to 1006+16 due to the above) [14:55:36] 5) Manually set lvs1016 to med=0, and restart pybal there (flips traffic to 1016, where it will remain). puppetization of 1016 can now happen at any time and is nop. [14:56:26] 6) Switch static fallback routes on routers to lvs1016 now (for good) [14:56:49] 7) Stop->decom lvs1003 to get it out of the way for good (can't ever start pybal again, and remove from router configs as a peer) [14:57:26] 8) Sanity restored (I guess we never needed to disable puppet on lvs1006, its config never changed and it never needed a restart) [14:57:40] so if we decom lvs1003 first, we should probaby move lvs1016 to cr1, as lvs1003->cr1 and lvs1006->cr2 [14:57:50] XioNoX: I was about to propose that :D [14:57:54] yeah [14:58:23] we could make a different set of instructions that ends up decomming 1006 and leaving 1003 as backup, but seems saner and less-confusing for now to still have 456 be backups [14:58:29] (instead of 345) [14:59:13] another option is 1) add static routes (with higher priority than BGP) to direct traffic to lvs1016, 2) do whatver MED,etc changes needed 3) remove static routes [14:59:27] yeah that too [14:59:45] in that scenario the full steps from my (3) onwards would be: [15:00:02] 3) Set static route to lvs1016, higher prio than BGP [15:00:55] oh hmmm, nope [15:01:19] it can be done that way, but it doesn't end up saving us any troubles with blipping traffic around more than once or having very-short-term ECMP-splitting issues. [15:01:43] (or cr1/2-split-decision issue, either way) [15:03:08] router config has been moved from cr2 to cr1 [15:03:54] XioNoX: https://gerrit.wikimedia.org/r/432091 [15:05:34] eventually, I think we should look towards a solution where LVSes are A/A and the router splits on L3/4 hashing for ECMP. [15:05:40] will take some planning and thinking, though [15:06:04] wanna switch lvs1003 & lvs1016 now? [15:07:06] Code Review - Error [15:07:07] 500 Internal server error [15:07:07] I guess I can't put emojis in code reviews comments :) [15:07:16] hahahah [15:07:21] nice [15:08:01] vgutierrez: now's as good a time as any. push the cr1 thing first of course [15:08:05] relocating to next door's coffee shop, will be back in less than 10 if you need me [15:08:17] maybe wait for X to get back JIC [15:08:43] XioNoX: ack :D [15:09:55] so when XioNoX is back let's validate that lvs1016 speaks BGP with cr1-eqiad as expected and then let's switch :D [15:10:46] I feel like I have a duty here to point out the scare-factor in this set of changes, in order to install sufficient self-discipline. [15:11:14] if we break eqiad "low-traffic" LVS group's routing at any point, even for just a few seconds, it will break virtually everything that happens in our infra aside from edge content cache hits :P [15:11:23] bblack: hey, I managed to reimage all our lvs without breaking anything :P [15:11:27] (all MW fails, all services fails, etc) [15:12:00] if we temporarily end up with an ECMP traffic split, it will effectively be an outage [15:12:09] (and of course now I've jinxed it) [15:12:58] and the blips in the above steps (the brief flip from 1003->1006->1003, then later the permanent switchover from 1003->1016), will all cause some hopefully-very-minor performance spikes probably, as some connections will fail and immediately reconnect or whatever. [15:13:52] (so we should probably pause ~5 minutes between steps 4+5 above) [15:20:54] it would be interesting as a possible future direction to explore, to try to chash-split at every layer and keep most traffic flows row-local in a core DC where possible. [15:21:38] why ecmp would cause outage? [15:22:05] XioNoX: let me know when I can test lvs1016<-->cr1-eqiad :) [15:22:10] vgutierrez: let me know when bgp is enable on 1016, looking at cr1 [15:22:12] eh [15:22:13] hahah [15:22:36] XioNoX: because LVS pays attention to tcp states. if traffic is splitting randomly over 2xLVS, with the rest of our current setup, it will break things I'm pretty sure. [15:23:08] XioNoX: May 09 15:22:55 lvs1016 pybal[10568]: [bgp.BGPFactory@0x7fe363046638] INFO: BGP session established for ASN 64600 peer 208.80.154.196 [15:23:12] XioNoX: looking good here [15:23:32] yup, prefixes are there, with a med of 100 [15:23:38] lovely [15:23:55] so let's stop bgp on lvs1016 and begin the switch as explained above [15:24:53] (let me downtime lvs1003 and lvs1006 too) [15:25:18] 1006 isn't doing anything [15:25:29] should be able to stay online and puppet-enabled there and just not touch it [15:25:36] right :) [15:25:47] vgutierrez: ping if you need me to sanity check anything, or do anything :) [15:26:13] really none of them should really fail in the icinga sense, until we get to the part about decomming lvs1003 without ever bringing its pybal back online in (7) [15:28:21] I have shells open on all 3 to stare at things, but planning to be readonly [15:28:24] FYI [15:28:29] 1. Setting med=101 @ lvs1016 :) [15:29:32] uh [15:29:37] wut? [15:29:56] 1+2 was disable puppet on 3+16 and merge the puppet change of primaries [15:30:09] damn scrolling [15:30:21] 3 is med=101 on lvs1016 restart (but it's ok to mix them up so far!) [15:30:24] puppet is disabled in lvs1016 tough :) [15:30:37] let me merge that for sanity sake :D [15:30:47] repeat simpler with less rambling: [15:31:03] 1. disable puppet on 3+16, merge puppet change of primary LVS from 3->16 [15:31:13] 2. set med=101 on 16, restart pybal [15:31:36] 3. set med=1 on 3, restart pybal (blips traffic 3->6->3) [15:31:39] 4. wait 5 minutes [15:31:56] 5. set med=0 on 16, restart pybal (flips traffic 3->16 for good) [15:32:25] [can enable+run puppet on 16 now or anytime later, should be no-op] [15:32:49] 6. flip router static fallback routes from 3->16 [15:33:14] 7. stop pybal on 1003, remove 1003 from router peer configs, etc... decom host without ever bringing this pybal back into active bgp again [15:34:08] I think that's it (sorry numbers changed again!) [15:36:59] bblack: let's enable BGP in lvs1016 as part of step 1 as well :) [15:37:14] yeah that would definitely help :) [15:38:10] 1 --> puppet disabled in lvs1003 and 1016. puppet change merged [15:38:30] 2 --> med=101 on 1016, pybal restarted done as well [15:39:03] (sanity check) [15:39:04] [BGP/170] 00:09:49, MED 101, localpref 100 [15:39:04] AS path: 64600 I, validation-state: unverified [15:39:04] > to 10.64.49.16 via ae4.1020 [15:39:14] MED 101 as expected :D [15:39:22] ok [15:40:44] 3. med1 @ lvs1003 and pybal restart [15:41:24] and lvs1006 is showing some small connection counts which are now stable (will eventually go back to zero) [15:41:27] again.. sanity check [15:41:31] 10.2.2.11/32 *[BGP/170] 00:00:21, MED 1, localpref 100 [15:41:31] AS path: 64600 I, validation-state: unverified [15:41:31] > to 208.80.154.57 via ae1.1001 [15:41:31] [BGP/170] 00:12:09, MED 101, localpref 100 [15:41:31] AS path: 64600 I, validation-state: unverified [15:41:33] > to 10.64.49.16 via ae4.1020 [15:41:34] which is evidence of the 3->6->3 blip [15:41:51] MED --> 1 from lvs1003 [15:42:04] hopefully in the time it took 3's pybal to restart bgp, many connections didn't notice what was going on because momentarily idle :) [15:42:35] * vgutierrez time.sleep(300) [15:42:39] so now we wait ~5 minutes for any minor fallout to clear itself [15:42:42] :) [15:47:22] * vgutierrez awake [15:47:34] ok [15:47:38] so.. med = 0 in lvs1016 [15:47:50] restarting pybal [15:48:46] another sanity check [15:48:46] 10.2.2.11/32 *[BGP/170] 00:00:05, MED 0, localpref 100 [15:48:46] AS path: 64600 I, validation-state: unverified [15:48:47] > to 10.64.49.16 via ae4.1020 [15:48:56] MED 0 seen by cr1-eqiad [15:50:23] yeah connections seem to be draining off of lvs1003 ipvs lists, and 1016 is picking up traffic [15:50:31] yup [15:50:48] XioNoX: please could you switch the static route from 3 to 16? [15:52:25] (the normal fallback one, not a new high prio one) [15:52:38] right :) [15:52:48] done [15:53:05] awesome [15:53:18] reenabling puppet in lvs1016 and getting rid of icinga downtime there [15:54:22] puppet noop confirmed on lvs1016 [15:54:44] (it actually trimmed a couple of trailing spaces) [15:55:04] so now just stop lvs1003 pybal and leave it downtime and puppet-disabled too [15:55:46] and then xionox can remove the lvs1003 BGP peerings from crX, and you can merge changes to remove lvs1003 from the set of low-traffic LBs and switch it back to role (spare::system) [15:56:19] maybe un-installed pybal package before re-puppeting in general, so it can't try to come back from the dead. [15:56:34] hmmm I can even reimage it as stretch - spare [15:56:48] right, probably should to clear cruft [15:56:57] pybal stopped @ lvs1003 [15:57:03] 208.80.154.57 removed from router's config [15:57:11] puppet is disabled, so I'm disabling bgp manually on lvs1003 [15:57:14] (after confirming BGP was down) [15:57:38] nice .D [15:57:42] looks like nothing/nobody else noticed what we did, so I'd call that success :) [15:58:07] hmmm XioNoX could you check that you are not seeing any interface error on lvs1016? [15:58:18] like the ones you detected on lvs2004 [15:58:41] our strategy if anything goes initially-wrong via lvs1016 here is stop pybal on 16 and let lvs1006 take over [16:00:08] vgutierrez: so far so good [16:00:16] awesome, thx <3 [16:01:08] librenms will send an email if any [16:03:00] SO [16:03:04] SMOOTH [16:09:12] bblack: https://gerrit.wikimedia.org/r/#/c/432116/ that should be it [16:14:09] vgutierrez: +1 +nitpick, but the nitpick can wait, either way [16:15:18] yup, I saw that but didn't want to mess with codfw, but if we are replacing those as well, makes sense :) [16:17:47] bblack: lvs1004-1006 are staying, right? [16:18:20] or what's the final picture in eqiad? [16:18:57] the final picture in eqiad is lvs1013-16 are the only LVSes. and in codfw, 2001-6 will all be replaced by 2007-12 (still early in procurement process) [16:19:09] ok [16:19:12] err sorry, that's confusing [16:19:17] the final picture in eqiad is lvs1013-16 are the only LVSes. and in codfw, 2001-6 will all be replaced by 2007-10 (still early in procurement process) [16:19:56] both are moving from a 3+3 config to a 3+1 config (and the new 4 are one-per-row, the others were grouped 3 each in 2 rows, none in the other rows) [16:21:02] the 3+1 layout for which-traffic-goes-where will be similar to the new 2+1 we have in ulsfo+eqsin. The primaries will each have one traffic class on them by default, and the single backup has definitions+capabilities as failover for any/all of them. [16:21:31] nice :D [16:22:12] the current lvs10(0[12456]|16) state is just a temporary hack, because it was a priority to move primary "low-traffic" to better hardware primary. [16:23:06] kills one big question-mark that pops up over and over when investigating tricky anomalies: whether the crappy low-end 1Gbps interface on lvs1003 and/or the various ethernet queues touching it were causing loss/delay during small spikes. [16:23:48] (the high-traffic ones aren't nearly as swamped with pps or bps, so lesser concern there) [16:25:02] I've extended lvs1003 downtime for a week to avoid noise, tomorrow morning it will be reimaged anyway :) [16:27:00] sounds good :) [16:27:39] it's beer o'clock now, have a nice day! [16:27:55] you too :) [19:10:24] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4194981 (10ayounsi) [21:47:44] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4195489 (10ayounsi) [22:05:39] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4195555 (10ayounsi) Thanks for unblocking that. Let's aim to do the move on Thursday May 24th, morning east coast time. Those server types will suffer a few...