[00:23:15] 10HTTPS, 10Traffic, 10Browser-Support-Firefox: Allow option to uncheck "Always use a secure connection when logged in" at login.wikimedia.org/wiki/Special:Preferences - https://phabricator.wikimedia.org/T71319#4016186 (10TBolliger) [00:40:45] ema: I ended up doing the standard discard-all-actives manually everywhere on all fe+be over some time . nothing went wrong. [00:41:24] ema: it is notable that some (esp upload frontends?) have a lot of ancient VCLs that stay "warm" forever though. maybe a bug, maybe a symptom of way-too-long-lived client connections somehow (or nginx->varnish reused conns)? [02:46:44] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4016446 (10Prtksxna) >>! In T188362#4013204, @Aklapper wrote: > Curious: @Prtksxna, can you access that?) Nope. I don't see "Vis... [05:24:57] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Prtksxna) >>! In T185282#4015968, @Bawolff wrote: > Speaking of which, if the code already exists, you should request t... [10:43:14] 10Traffic, 10Operations: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#4017007 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I mentioned this task and problem to a friend working in SRE networking, we're now receiving about one tenth of the icmp traffic inbou... [10:49:51] godog: \o/ [10:55:07] bblack: prep work (hopefully) done: https://gerrit.wikimedia.org/r/#/c/415835/ :D [11:42:12] vgutierrez: looks awesome :) [11:42:24] godog: that was easy :) [11:47:44] bblack: yeah, so much fuss about this icmp thing and here comes godog and fixes it with a brief chat [11:55:38] this is so cool [11:55:40] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=68&fullscreen&edit&orgId=1&var-server=cp4021&var-datasource=ulsfo%20prometheus%2Fops [11:55:48] added NUMA hits vs. misses per node [11:56:13] with numa_networking:on there seems to be 0 misses on node 0 [11:57:45] see in particular cp4022, on which numa networking was enabled a few hours ago: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=68&fullscreen&edit&orgId=1&var-server=cp4022&var-datasource=ulsfo%20prometheus%2Fops [11:59:19] yeah but all the hits turned into misses on node1 :) [11:59:24] (for 4022 anyways) [11:59:54] yeah I was looking at the half-full glass [12:00:16] anyways, I don't know that their stats will be fully-reflective of benefits until they reboot anyways [12:00:38] as the varnishd procs were started and allocated most of their ram (or fs cache as the case may be) under the previous regime [12:01:48] right, we'll get a clearer picture after reboots for kernel upgrades [12:02:27] I might push an auto-discard patch for misc today. I did all of A:cp yesterday manually (as in manually with vcl.list+awk) though and there were no issues fe or be. [12:02:40] just still seems dumb to turn it on for text or upload heading into a weekend [12:02:41] ok [12:02:52] yeah let's limit it to misc :) [12:03:24] the more-interesting result of that is just how many warm VCLs linger in places [12:04:17] you mentioned that they stay warm "forever" [12:04:31] can you quantify that? :) [12:05:06] if it's way longer than any of our timeouts, then I'd think of a bug [12:06:44] https://phabricator.wikimedia.org/P6775 [12:07:04] note the ones with -root- in the name predate the reload-vcl update [12:07:28] they've all been discarded now, by me yesterday, but they're still warm (were before discard, too) [12:07:51] I think the worst of them had a number like ~1300 before, so they may eventually be working their way towards going cold [12:08:22] re "way longer than any of our timeouts", there are probably situations in which no timeout applies, which might particularly be the case with long-lived nginx->varnish persistent conns? [12:09:18] I don't know for sure why all those VCLs are held up in "warm" [12:17:56] I could try with the particularly nasty live case of cp4023, seeing if restarting nginx under a depool has any effect [12:18:03] it would at least narrow the scope of possible boogs [12:18:40] trying that... [12:19:52] err wait don't have to depool even, just do nginx upgrade thing [12:20:20] (and wait for old workers to drain off and shut down) [12:22:39] would old workers actually die off after a while in case of client->nginx long-lasting persistent conns? [12:22:47] but really, those nginx workers were only ~6h old anyways from the numa stuff, now that I think about it [12:23:28] I've seen nginx workers hang around for quite a while on upload before, but usually they eventually die off. I don't think I've seen them stick around on the scale of ~1h or more. [12:24:04] currently there's 14/48 remaining of old nginx workers a few minutes later [12:24:13] now 12 [12:27:45] ema, regarding releasing pybal 1.15, I think that we could include the BGP logging + expose BGP status over prometheus, that will allow us to merge the icinga check [12:27:51] of course on Monday [12:30:26] patch for review :) [12:31:13] vgutierrez: was there a reason why you didn't also include the tcp ports alongside the ips in the bgp metrics? [12:31:18] not always available or something? [12:32:39] indeed [12:33:05] that + ema mentioning that the client side port could mess a little bit the indexes in prometheus [12:33:09] vgutierrez: I think we should cut a 1.15 with everything currently in master, unless there's a valid reason to avoid doing so [12:33:23] ok [12:33:28] I was playing safe... [12:33:45] I don't want to break anything xD [12:33:54] master is meant for non-breaking changes :) [12:33:56] see README [12:36:06] the warm VCL refcounts didn't change (at all) as a result of nginx workers all recycling on cp4023, so there goes that. [12:36:41] and I don't think we have the kind of connection rates (much less persistence!) on the direct port 80 conns to account for those numbers in general [12:37:06] so it's hard to imagine varnish is holding these over any kind of legitimate client-side persistence from its pov [12:42:22] pybal's idleconnection? [12:42:52] surely those die after a while? [12:43:01] but yeah I donno [12:43:05] vgutierrez: since bgp_session_state_count has local and peer address labels, it should never go above 1, correct? [12:43:12] anything else would be a bug in pybal (which may exist in fact) [12:44:45] I can see the pybal connections to port 80 on cp4023, there's only one estab and several in timewait, seems likely they recycle pretty routinely [12:46:01] mark: nope [12:46:07] actually the port 80 pybal idleconns seem to recycle about every 3 seconds, so maybe they're not working in an ideal way anyways [12:46:11] but definitely not holding up vcls [12:46:16] ok [12:46:30] it never go above 1 once after one session achieves the ESTABLISHED state [12:46:48] before that... well :) [12:46:52] it could be 2 [12:46:55] yup :) [12:47:30] or even more if some peer is pretty agressive regarding TCP connections to achieve a BGP session [12:47:49] BTW, this already helped me discovering a strange behaviour in pybal-test2001 [12:48:09] I have in my TODO open a bug, but I wanted to check it with you before [12:49:48] mark: https://phabricator.wikimedia.org/P6776 [12:50:24] what about it? [12:50:33] this happens every time in pybal-test2001 if you restart pybal fast enough, the quagga instance initiates a TCP connection to pybal:179 and that session achieves the ESTABLISHED state [12:50:38] after that happens [12:50:49] you get that output in pybal [12:50:54] with one FSM in state: ACTIVE [12:51:31] but a capture traffic doesn't show any new TCP connection to 10.192.16.140:179 [12:53:06] so yes [12:53:12] the incoming (passive) side gets established [12:53:18] but the active protocol instance also exists [12:53:20] and keeps trying to connect [12:53:25] fails, and keeps reconnecting [12:53:31] it says it keeps trying [12:53:42] but it doesn't even a attempt a 3way handhsake [12:54:08] (anesthesia induces typos apparently) [12:54:19] def connectRetryEvent(self, protocol): [12:54:19] """Called by FSM when we should reattempt to connect.""" [12:54:19] self.connect() [12:54:38] i believe some routers don't even accept a connection from a peer they already have a session with [12:54:40] maybe quagga is the same [12:54:43] but that's just me speculating [12:55:02] and that would not trigger the collision code in pybal I guess [12:55:06] right.. but a tcpdump should show pybal calling to quagga, right? [12:55:15] yes at least one packet [12:55:40] I'll repeat the tests and get pcaps this time [12:55:48] cause I'm talking from memory right now [12:56:18] self.report("(Re)connect to %s" % self.peerAddr, logging.INFO) [12:56:18] if self.fsm.state != ST_ESTABLISHED: [12:56:18] reactor.connectTCP(self.peerAddr, PORT, self) [12:56:22] well partially I am too [12:56:30] and I wrote this code 10 years ago so my memory has faded a bit ;-p [12:56:33] but anyway [12:56:38] this is the top of the connect method [12:56:49] and it doesn't check whether the tcp connection is already established [12:56:49] yep [12:57:07] so perhaps the connectRetryEvent fires even though it is established [12:57:19] and twisted decides it shouldn't connect [12:57:21] then again [12:57:25] oh I see [12:57:25] why wouldn't twisted just open a new connection [12:57:31] yes that's what it would do I guess [12:57:32] hmm [12:57:54] also the DEBUG messages indicate at least some connection gets closed [12:58:02] ok [12:58:04] i'll look into it more [12:58:28] let me get the pcaps before, I don't want you to lose time over this [12:59:17] ema: I tried digging in varnish source about what holds refs on aging VCLs. The answer is exactly as I expected: a few different things, in complicated ways, all over the place, with code that looks ripe for leaking references, and which I will never understand. [12:59:54] it's notable that busyobjects are one of the things that hold such references, though. I wasn't expecting that. [13:00:22] (because they're related to the backend they're being created for a fetch through, and backends are part of VCLs) [13:04:12] (getting the pcap now... waiting to reach IDLE state on the second FSM) [13:05:24] vgutierrez: anyway, i think in general what's going wrong here is that there can be multiple FSMs in existence for a single peering, all with their own timers potentially firing reconnect events etc [13:05:50] that can really mess things up unless treated VERY carefully [13:08:57] ema: ahah: https://github.com/varnishcache/varnish-cache/issues/2228 [13:09:52] mark: yep [13:09:58] # Hand over the FSM [13:09:58] protocol.fsm = self.fsm [13:09:58] protocol.fsm.protocol = protocol [13:09:58] # Create a new fsm for internal use for now [13:09:58] self.fsm = BGPFactory.FSM(self) [13:09:58] self.fsm.state = protocol.fsm.state [13:10:02] that's probably part of the problem [13:12:10] i feel like ripping out all this stuff and making it either entirely active or entirely passive ;) [13:12:32] ema: so in general, varnish just has this unresolved problem. even if no other bugs, if we have tons of idle threads and we swap VCLs while some of them are active, and they later go inactive (these threads not needed for any live requests for a long time), the reference never gets lost. [13:13:27] ema: it might reduce the scope of the problem if we didn't have such a high minimum thread count, but at the end of the day it's either upstream bugfix or routine restarts (which we don't do for frontends, which is why they eventually pile up there) to completely eliminate it. [13:14:21] mark: yep... that was mentioned by bblack [13:14:51] taking into account that we control both sides, he suggested changing routers behaviour to be always passive or always active [13:14:53] yeah it's not an awful idea for our use-case here of pybal<->router, to just be active-only and say to configure the router side as passive-only. [13:15:21] (or even if router can't be configured that way, I guess ignoring its SYN is sufficient and you arrive at a sane state) [13:15:50] yes [13:15:55] and even with the existing code that would solve the issue [13:15:58] just not the bug ;) [13:16:06] but yes you can configure routers that way [13:26:39] vgutierrez: re latest patch, how about we add peer address to log lines from (only) the BGP class? [13:29:04] I mean, protocol instances are only made when connections exist so presumably that info is available [13:29:10] and even if not, then we skip it [13:30:55] and maybe we can create debug lines containing instance addresses when making e.g. the FSMs etc so we can cross reference them from log output [13:31:19] right now we just know when two or separate instances log something, but what peering session they're tied to remains a guess [13:33:37] hmmm we have two peering sessions per ASN right? [13:33:57] one for active connections and another for passive [13:34:37] so if we log ip:port for one side, we'll know the peering session [13:34:59] keep in mind also, in case it affects any of this or the debug output issues: I'm not sure which branch the code is on (if any yet), but one of the upcoming feature-reqs is for pybal to have BGP sessions with multiple distinct peers as well (as in 1x pybal advertises its stuff directly to 2x distinct routers). [13:37:31] bblack: that's already written and in master [13:37:34] i did that a while ago [13:37:41] and we're doing this on top of master [13:37:45] ok cool [13:37:51] that's probably already in production [13:38:34] vgutierrez: remote then, as the local side should be the same for all peerings [13:38:45] and is fixed by config anyway, so not that interesting [13:38:48] mark: no, it's not in our production releases of 1.14.x yet, just in master [13:38:56] ok could be [13:38:57] (ditto for per-service MED) [13:40:19] yeah, per service med is also done [13:41:17] and all this is also why we should add the peer address to log lines, before pybal only ever had one peer, so a passive and active connection at most [13:41:22] you could mostly work that out [13:41:25] with multiple peers... :) [13:41:36] yeah [13:42:19] once multi-peer + per-service MED is in, we can go all out on re-designing the puppet-level layout of things to be a bunch more redundant and not have so many idle LVSes. [13:44:00] https://phabricator.wikimedia.org/T165764 has the rough sketch of how for the per-service MED stuff [13:44:04] yeah that's why i did it, considering we're buying a bunch of lvs servers [13:44:28] having them all talk to both routers just makes sure LVS-redundancy and router-redundancy are independent (losing something from one set doesn't negate any redundancy in the other) [13:45:08] yeah all the new LVS order counts are modeled on that. 4x lvs for core sites, 3x lvs for edges (so there's always one idle spare still). [13:46:48] in theory we could reduce those to 3x and 2x, but it's still nice to have at least one spare doing no traffic, and have no combining of normally-separate traffic in a 1-server-dead scenario. [13:46:55] makes maintenance less risky if nothing else. [13:47:14] yes [13:47:31] i don't think we need to cheap out on the most critical boxes in our network handling ALL our traffic ;) [13:47:39] :) [13:48:10] yeah the reduction is just to remove pointless excess spares under the new model [13:48:24] (where they're all potentially redundant for each other without reconfig) [13:58:27] bblack: so the varnish ticket was closed, but the problem still there? :) [13:59:18] yeah :) [13:59:30] back in april it was closed with "Timing this issue out until I hear about this on the other support channel." [13:59:37] then someone commented again about reporting it in Nov, still closed [14:00:05] >> Response on ticket was "yes, we are aware of the issue, with no eta on a fix" [15:49:04] vgutierrez: did you get that pcap? [15:49:10] (not urgent, i have training starting now) [17:21:32] 10Wikimedia-Apache-configuration, 10Operations, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4018471 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:21:45] bblack: what do you think we should set as max_core_rtt for labs hosts? [17:22:23] puppet is currently broken on my test cache server in labs because we don't have any default for the setting [17:24:37] not that the setting itself makes much sense there anyways, as we don't do request routing labs->codfw for instance [17:29:35] bblack: https://gerrit.wikimedia.org/r/415900 ? [17:44:54] no idea, sure, 0 :) [17:44:57] I donno :) [17:45:40] CVE ID : CVE-2017-5660 CVE-2017-7671 [17:45:40] Several vulnerabilities were discovered in Apache Traffic Server [17:45:50] good thing we're not using it yet! :) [17:46:01] ;) [18:54:02] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4018837 (10faidon) [18:55:32] 10Traffic, 10ORES, 10Operations, 10Scoring-platform-team, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#4018842 (10greg) 05stalled>03Resolved a:03BBlack >>! In T179156#3782508, @BBlack wrote: > No, we never made an incident r... [18:58:42] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4018856 (10Dzahn) done! https://transparency.wikimedia.org (as always) https://transparency.wikimedia.org/private (now removed... [18:59:19] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4018859 (10Dzahn) 05Open>03Resolved [19:36:33] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4018972 (10APalmer_WMF) Thank you so much, @Dzahn! [20:15:39] 10Traffic, 10DNS, 10Operations, 10WMF-Communications, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4019085 (10greg) [20:16:01] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4019086 (10greg)