[08:41:40] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2375417 (10Joe) The bug I mentioned earlier was for the `AH01070` and has been... [12:32:23] 10Traffic, 10Analytics, 06Operations, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2375979 (10BBlack) I merged the above, which just un-sets Set-Cookie, but we may want/need to look at this deeper and disabling the setting of the cookies in the first... [12:34:01] 10Traffic, 10Analytics, 06Operations, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2375991 (10BBlack) (also, all the same probably applies to maps.wm.o tile requests (which is almost all requests there, but not the leaflet/css/js fetches?), which coul... [13:20:26] 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376097 (10jcrespo) [13:56:48] bblack: the new varnishkafka version seems to have fix the problem that we had with data consistency check in analytics, but something else came up, namely requests logged by vk with timestamp "-" (that should be the default value if vk doesn't find the tag). It is happening in misc and for very few requests (mostly etherpad or something like /mediawiki/1.26/mediawiki-1.26.3.tar.gz). I suspect th [13:56:54] at when some timeout occurs between varnish and a client then the SLT_Timestamp "Resp" tag is not logged.. Would that be possible? [13:58:23] elukey: not sure [13:59:13] elukey: one of the VCL-level changes observed, though, was that in varnish3 all response traffic went through vcl_deliver, whereas now synthetic-error cases skip vcl_deliver and only use vcl_synth. [13:59:42] elukey: so what you're observing may be a mirror of that pattern. perhaps SLT_Timestamp "Resp" only happens in the vcl_deliver-like path [14:04:28] bblack: thanks! Will try to run varnishlog and hope to get a valid log that will prove my point, but it might take a while.. We are going to exclude these requests from our consistency check but we'll keep counting them to avoid masking any issue [14:12:49] bblack: while working on T137114 we found that a few machines in misc are running vhtcpd (and I understand they shouldn't) [14:12:50] T137114: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114 [14:12:57] should we stop it / remove it? [14:19:25] ema: we should in the long run. it's leftover unpuppetized stuff from switching the machines' roles without rebooting I think. but if it helps testing, leave it in until you're done testing with it. [14:20:11] misc probably should have a vhtcpd on a separate multicast address in the long run, but so far we don't have any cache_misc backends that send their own PURGE traffic flows either, so it's kinda pointless for now in practice. [14:20:46] well on one hand having vhtcpd running has been useful to discover the varnishlog4 issue :) [14:21:23] on the other hand that specific problem is fixed now, so I wouldn't think we need to keep vhtcpd running really [14:55:33] 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376258 (10Antigng_) My bot was using /w/index.php?action=raw to fetch the content of each page/redirect at zhwiki, then it will do some simple search/replace/templa... [14:58:03] bblack, ema: When adding a new server to an LVS pool, is it weighted at 0 by default? I'm trying to get ready to send traffic to our new maps servers... [14:58:38] gehel: the weight is defaulted to some normal weight when initially added via puppet, but the "pooled" parameter is initially set to "no", and then you manually pool it in with confctl. [14:59:21] bblack: ok, so I will be able to pool them slowly and check that nothing is breaking. Thanks! [14:59:43] right [15:00:04] our pattern with other such switches has been to slowly pool in the new ones, then slowly depool the old ones [15:00:24] once all the old ones are dynamic-depooled, then remove them from the puppetized config for the service and they'll get pulled completely from the runtime lists [15:00:34] bblack: That's what I was planning to do. [15:01:24] gehel: https://wikitech.wikimedia.org/wiki/Depooling_servers for the confctl commands [15:01:34] We also have new maps servers that will be ready at some point in eqiad. My thinking is that we should use those in an active-active way. Does it make sense? [15:02:00] ema: thanks! I was just reading that page :) [15:02:05] those are depooling example, set/pooled=yes for pooling of course :) [15:03:30] gehel: we'll want to set up a separate kartotherian.svc.eqiad.wmnet for the eqiad ones, and then update the cache_maps applayer stuff to list both, but choose one (codfw currently), as we do for e.g. restbase in cache_text [15:03:42] true active/active isn't yet supported by the cache clusters, but will be.... [15:04:32] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/cache/text.yaml#L57 [15:04:38] ^ restbase example [15:04:55] vs current maps config: [15:04:57] bblack: ok, so we do the prep work for active/active... And that should also enable us to easily switch to eqiad if codfw/kartotherian goes down [15:04:57] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/cache/maps.yaml#L23 [15:05:03] right [15:05:32] after we're sure they're both working ok, we should probably switch to eqiad as the normal source, just so it's aligned with every other service on that. [15:06:10] yep, make sense [15:06:48] 2 last questions, to set the weight, is it the obvious --action set/weight=1234 ? [15:07:18] yep! [15:07:56] for once something obvious! [15:09:39] bblack: after how long do we want icinga to complain if zerofetch's successful-run file is stale? [15:09:53] (https://phabricator.wikimedia.org/T132835#2211924) [15:10:03] looking at https://config-master.wikimedia.org/conftool/codfw/ (and at the LVS config) I see that we have maps and maps-https. Do we still allow non HTTPS traffic for maps? Or is it just to expose a 307? [15:10:37] gehel: I think that's just for redirects to https [15:10:58] ema, bblack: thanks for your help! [15:15:23] 10Traffic, 06Discovery, 06Maps, 06Operations, 13Patch-For-Review: Send traffic to new maps200? servers - https://phabricator.wikimedia.org/T137620#2376339 (10Gehel) Actions required to enable traffic to new servers: # validate new servers are running fine # merge and deploy https://gerrit.wikimedia.org/... [15:17:55] 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376361 (10Joe) @Antigng_ you might not have seen anything go wrong, but your bot was accounting for 50% of the uncached requests to our backends or more. It's a cle... [15:19:11] ema: it runs every 15 minutes, but it's not hugely critical if there's an outage for a while either (we just lose new updates to carrier networks put in by humans, but things still work)... [15:19:32] ema: so probably something like warn after 4H, crit if it's 1D? [15:19:54] bblack: sounds good! [16:12:56] 10netops, 06Operations, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10akosiaris) https://gerrit.wikimedia.org/r/291819 and friends should be settings the grounds for mixing this mess finally [16:43:36] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2376676 (10faidon) @brion, any news? [16:45:35] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2376681 (10brion) Seems ok lately, haven't noticed any problems last week. [16:46:45] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2343755 (10yuvipanda) Through the proxy or the public IP? [17:01:48] 10netops, 06Labs, 06Operations: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2376702 (10brion) Either. You can have the IP back, I guess, doesn't seem to make any difference. [21:11:42] 10Traffic, 06Operations: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2377539 (10ori) [21:14:16] 10Traffic, 06Operations: Parametrization of VCL is inconsistent - https://phabricator.wikimedia.org/T137747#2377557 (10ori) [21:30:45] ori: yeah welcome to my hell. the answers aren't easy, and some of this is backtracking. mostly it's on hold over "get through varnish4 transition first" [21:31:41] I'll comment in the ticket, though [21:39:05] yeah, I know the history and all the work you've done to improve matters, I hope I didn't come across as patronizing [21:51:15] actually I take that back, I'm not going to try to compose a response right now, it's just too difficult :) [21:52:32] I tried, but couldn't even get through my first bullet point without getting mired in researching all the stupid reasons things are currently the way they are [21:55:44] ori: maybe if there's something more-meta I can say about the whole topic it's this: [21:58:03] the underlying problem is ridiculously complex and still has lots of overlapping dimensions that things vary on. some of the structure for that is more-complex than it currently needs to be, because we've removed some dimensions and then not yet refactored for their loss, too. There's also even more-fundamental structural problems than you're noting, like why are things arbtirarily split into m [21:58:09] odules/varnish, modules/role/cache/, and templates/varnish/ in the way they currently are, which makes little sense. [21:58:50] generally the whole thing is complex enough that it tends to defy any kind of rigorous cleanup plan that's mapped out from start to finish in the abstract. too many unforeseen details get in the way. [21:59:27] the way to attack it is to actually attempt refactor commits that attack at least one aspect of the problem and improve the situation, and then interate. [21:59:53] *iterate [22:01:33] so I'm not saying the problems you describe aren't real, I'm just saying, a 4-point plan to line everything up along certain seemingly-ideal principles is unlikely to work. you'll probably have to amend that plan 20 times along the way. [22:02:09] at an even deeper level, the problem is that our tools suck. [22:03:19] puppet has stupid limitations and inconsistencies. hiera, big ruby template-code blocks, puppet parser functions, etc... are all just bandaids trying to make the tooling suck less, but they're not necessarily a cleanliness win. [22:04:36] for hiera in particular, one of the big reasons for it is labs/beta compat without sprinkling realm-conditionals all over the place, too. and then we really don't use beta to test the caches/nginx anyways, making all that pointless weight around our necks anyways. [22:05:25] re: 4-point plan -- I thought of it the same way, and did not suppose things were that simple. Still, I noticed a few things when working on https://gerrit.wikimedia.org/r/#/c/294171/. I started by noting them in passing in the commit message, but that seemed hand-wavy and unhelpful, so I filed an issue, because sometimes it's useful to have someone who's not deeply familiar with the code give that sort of feedback [22:05:45] it is :) [22:06:33] ori: so back on that change - analytics doesn't look at upload traffic at all? [22:08:39] as for our tools sucking, what depresses me is that we've made them worse. Pre-Hiera, we had a working set of Puppet patterns that were pretty clear and the overall trend was convergence. The way we use Hiera is awful, and our attempts to make the integration with Puppet smoother were (in my opinion) rushed and short-sighted [22:08:42] (also, geoip is already restricted to text cluster, I shouldn't have mentioned it in the upload cookies at all. enable_geoiplookup is poorly named and scoped, at this point it should probably be something like "vcl_uses_libgeoip", or just an optional per-cluster parameter injection used only on text to inject that) [22:09:28] geoip stuff is due for heavy refactor soon anyways, on text's way to varnish4 [22:09:59] the request logs go to hadoop, but they don't factor upload reqs in page view and unique devices counts [22:10:08] so they don't need the additional headers [22:10:32] you mean they don't need X-Analytics at all? [22:11:46] (and does the same argument apply to cache_misc and cache_maps? I think the reason this stuff spread out of the text cluster in the first place was so they could log the same standard analytics everywhere) [22:14:33] I'm not sure about misc / maps, but I'm pretty confident they don't need X-Analytics at all on upload reqs. I'm second-guessing myself now so I'll check with them [22:15:05] anyways, enable_geoip really is just for the systemd unit at this point. only text has geoip code (which used to be on mobile+bits too). [22:15:34] and enable_set_cookie seems strange to me, but it's hard to word why [22:16:36] in practice it's controlling CP= and WMF-Last-Access= cookies. Arguably CP= is just another type of analytics-related cookie. we might well decide to use vcl-based cookies for some unrelated but important reason on cache_upload later, too. [22:17:04] because it doesn't capture the intent, and because the name makes the normal case seem like the exception [22:17:43] we want to know whether to intervene and strip Set-Cookie, not whether to intervene and allow it [22:18:04] but we don't really need to strip it in the upload case. we know the origin isn't setting cookies by design. we just need to not set these varnish ones. [22:18:05] I gave it that name to make it consistent with enable_geoip [22:18:48] that's true, but there are multiple code paths in VCL that inject cookies, and presumably there will be more in the future [22:19:05] yeah but the future ones might be things that upload does want [22:19:06] so I guess it's a question of whether you want to ensure we never set a cookie, but run the risk of masking a bug [22:19:38] or refuse to guess, but run the risk of a pointless cookie going unnoticed for a substantial length of time [22:19:42] I could go either way [22:20:23] I guess what I'm saying is that we haven't decided, or aren't trying really to declare, that cache_upload's nature is that it should never have cookies. [22:20:37] we're just noting that the only two cookies currently used with it are pointless there and should be eliminated [22:21:28] well, it's a bit stronger than that. The difference between one cookie and two isn't substantial, but the difference between one cookie and none is, because some proxies refuse to cache anything with a cookie header [22:21:57] yeah but that doesn't take away that we might have a future functional reason to add a cookie there [22:22:13] so it isn't merely that these two are useless -- it's that upload is in the advantageous position of not needing any cookies, so the ante for introducing one is higher [22:22:15] at which point that patch will need to rename that option, I guess? [22:22:28] also, what proxies? [22:22:43] I was guessing you were worrying about the browser cache not liking set-cookie [22:23:20] let me dig up the source for the proxy claim [22:23:33] what I mean is, what outside of us can proxy us? it's all TLS'd. [22:23:44] there are some that can with fake roots and org control of clients, but still. [22:23:51] oh, durr [22:23:57] yeah, that's a brain-fart on my part. [22:26:21] yeah I donno, I have to run out now. my parting thoughts are to leave geoip out of it (for now, but it could stand to be cleaned up another way), and make the rest all about flags for how much analytics we turn on for a cluster, and consider CP= an analytics feature too. [22:27:19] makes sense