[02:24:32] 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10Krinkle) 05Open>03stalled a:05Krinkle>03None @ema @BBlack This outcome of this task should be for the mtail/varnishrls... [02:24:45] 10Traffic, 10Operations, 10Performance-Team: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10Krinkle) [13:07:26] 10Traffic, 10DNS, 10Operations, 10Operations-Software-Development, 10Patch-For-Review: DNS repo: add CI checks for obvious configuration errors - https://phabricator.wikimedia.org/T182028 (10Volans) [14:37:14] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/478680 ... is a cookie name/val picked already? [14:38:06] <_joe_> bblack: PHP_ENGINE [14:38:13] <_joe_> but we can change it... [14:39:25] <_joe_> bblack: so we're separating the cache based on the request header, right [14:39:35] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) AH a task! I missed that. Making now. [14:40:14] <_joe_> ofc if for some reason the backend will respond with php7 for a request that should've been served by HHVM, we're out of luck [14:40:38] <_joe_> but tbh erring more on the side of caution would be useless imho [14:48:30] we can't really change things at that point anyways [14:48:44] it's the whole "vary has to be on attributes of a request" thing [14:49:39] PHP_ENGINE=php7 [14:50:32] _joe_: what sets the cookie, and do users already have it set in some cases? [14:51:50] hmmm I guess they must not, or there must be some other backend-side protection preventing it from working, or we'd already be polluting caches [14:52:50] <_joe_> bblack: no one should have it set, no [14:54:09] <_joe_> bblack: the idea is to inject the cookie to users who don't have it. At first it can be the appservers via a beta feature [14:55:20] <_joe_> during the transition I was thinking of "extracting" it either client-side (with javascript) or at the edge (so writing some more lines of vcl temporary logic) [14:56:37] "extracting" meaning what? [14:56:45] removing it after all requests are php7? [14:56:56] <_joe_> argh sorry heh [14:57:34] yeah, we should probably remove it slowly [14:57:51] e.g. some random low-probability cookie unsetter that ramps up over time in the caches [14:58:08] ok [14:58:13] <_joe_> I would make the cookie last 1 week [14:58:22] <_joe_> the way people usually do with a/b tests [14:58:33] hmmm yeah that works too, then you can just stop setting it [14:58:53] and they'll drain out of the cookied state pseudorandomly over a week then, presumably [14:58:56] <_joe_> "extracting" means randomly assigning the users to the two pools (PHP_ENGINE=php7/PHP_ENGINE=unicorns) [14:59:02] <_joe_> yes [14:59:42] right, we've done some hacky edge things for that before, e.g. "if client IP ends in 3" or whatever [15:00:21] <_joe_> I was thinking something like "if no cookie, pick a random number between 0 and 1" [15:01:11] it's hard to get that number right, since it's per-request not per-(session/user/device/whatever) [15:02:22] and even at a tiny value, the result is that all users will eventually become cookied if the cookie lasts forever. With a 1 week cookie you could math it out whether a certain rate leads to eventual all-php7, but only if you knew the min/max reqrate of clients or something. [15:08:11] _joe_: in any case, I guess the very first ones will be manual, and the next ones will be a beta checkbox that sets it per client. [15:08:25] <_joe_> yes [15:08:26] we have time I think to work out a sane way to ramp in some fraction of all clients [15:09:00] <_joe_> yes [15:09:07] but I think the last few times we've had some version of this conversation for various purposes, roughly splitting users on some bits of their client IPs is about as close as we can get in practice. [15:09:33] <_joe_> bblack: I don't clearly remember how we did it for hhvm, but I think we extracted the cookie on the backend [15:09:35] (random drive-by comment!) I think the traditional way is that you set a cookie to everyone, with a number from, say, 0-10 if previously unset, and then you just pick what those buckets mean internally [15:09:41] <_joe_> which is suboptimal ofc [15:09:44] (and it still spreads more than point-in-time client IP distributions would suggest, but users hopping IPs doesn't happen enough to seriously screw things up) [15:09:46] e.g. 0-1 (and later 0-3?) are php7, the rest are hhvm [15:10:04] that may have implications to user privacy, though [15:10:16] <_joe_> paravoid: that's a bit less clean and yes, that [15:10:50] 10 buckets isn't that much, maybe that's OK for a time-limited experiment [15:10:53] yeah, it's very easy for this to devolve to "oh if we just ignore all the privacy reasons we don't do this stuff for general analytics, we can...." :) [15:11:16] <_joe_> anyways, sorry, in a meeting :D [15:12:27] the client IP hack is a variant of the 10 buckets idea, just without a cookie, and with the implicit understanding that it's imperfect and has a slow spreading effect to it (e.g. if you're setting the feature cookie for 1/10th of IPs, users who hop IPs more randomly and frequently are more likely than you'd think to pick up the cookie eventually) [15:13:12] if you do this client IP-based, why set a cookie? [15:13:17] but clients hop IPs less than you'd think, I think (heh), and if the cookie has a fixed 1-week life it will mitigate that to some degree as well. [15:13:31] <_joe_> well the cookie is a convenient way to let all the layers know what they need to do [15:13:46] <_joe_> it's also something you can mangle from the application [15:13:49] right, the cookie records the decision and gives us the indicator to vary the caches, etc [15:14:05] (and slowly drains out of vary slotting when we stop setting it) [15:15:30] and at least gives us some notion of reasonable stability (e.g. in the case a user is on some mobile network and constantly moving and changing IPs, they're not going to constantly flip-flop, they're just more likely than average to eventually get the cookie and keep it for a week+) [15:37:21] on the topic of periodic IP changes: https://labs.ripe.net/Members/ramakrishna_padmanabhan/reasons-dynamic-addresses-change [15:37:32] (my IP at home changes every day for example) [15:43:48] ema: can you check over https://gerrit.wikimedia.org/r/c/operations/puppet/+/478680 for whether it makes sense for the task at hand? [15:45:57] _joe_: separately, I wanted to pick your brain about the TTL-editing we do on discovery DNS during planned switchovers... [15:46:13] <_joe_> bblack: sure [15:46:33] <_joe_> bblack: we could also just wipe the caches btw [15:46:58] _joe_: how necessary/useful is it, or are we just taking advantage of an optimization because it's there? [15:47:19] <_joe_> bblack: well depends on the switchover [15:47:36] <_joe_> for mediawiki, we don't want to have to wait 5 minutes of read-only if the switchover is planned [15:47:51] the context is I'm contemplating getting rid of dynamically-varying TTLs from gdnsd's dynamic stuff, because it turns out it's one of those tricky things that's very limiting/complex for future feature needs [15:48:07] <_joe_> bblack: we could just reduce the ttl then directly [15:48:11] <_joe_> and wipe the caches [15:48:19] <_joe_> when we need it [15:48:35] or just reduce the TTL more than a TTL before, if it's planned. then no need to wipe caches. [15:48:55] or just set it to a minimal value always, if it's perf-acceptable. [15:49:10] <_joe_> I think 10 / 15 seconds is acceptable [15:49:51] there's a lot of ways the feature adds needless general complexity, but the killer interaction is with DNSSEC [15:50:03] I mean, we could do DNSSEC and keep varying TTLs, it just makes things trickier than it should have to be. [15:51:31] (because the signatures cover the TTL value too, so it would mean we can't pre-generate them if they're volatile at runtime (vs e.g. zone reloads)) [15:53:11] or we could hack DNSSEC in the procotol-vs-recommendation sense and output the types of TTLs that validating caches do, but ewww [15:54:14] but yeah ok, doesn't sound like our use of it is strictly necessary [15:54:38] I don't think anybody's is really, but people will invent uses for anything that's there to use. [15:54:55] https://xkcd.com/1172/ [16:07:55] bblack: I'm trying to understand whether setting up vary slotting in the common vcl instead of the backend-most backend vcl only can cause troubles of any sort? [16:08:42] ema: probably! :) [16:08:49] it's at least pointless, I think? [16:09:24] at the time I think I was just keeping the code together in one file, but really, the cookie->header could be FE-only, too [16:12:41] yeah, cookie->header fe-only and vary-slotting be-only is a good idea I think [16:12:54] ok [16:14:17] also there's 11-vary-cookie.vtc which can help to test whether the patch works as advertised :) [16:16:23] yeah I almost never edit tests because I still don't have a good workflow for running them and don't want to make one :) [16:16:45] in any case, we only read the cookie, and we do it before the evaluate_cookie mangling [16:17:11] :) [16:17:30] updated patch [16:17:50] looking [16:18:12] if you want to amend it with a working test, that does have value :) [16:21:53] ah, now I have to write two tests because of the fe/be split [16:21:54] on it [16:22:43] 10netops, 10Operations, 10ops-codfw: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10Papaul) The switch is connected to port 48 on scs-a1-codfw [16:23:10] 10netops, 10Operations, 10ops-codfw: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10Papaul) [16:28:51] 10netops, 10Operations, 10ops-codfw: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10Papaul) [16:31:12] bblack: we might want to do the slotting only at the backend-most layer? [16:31:36] ema: ? [16:32:21] 10netops, 10Operations, 10ops-codfw: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10Papaul) [16:33:20] bblack: for clarity, it could perhaps help to only do the slotting at the backend-most layer to avoid having the same object in eg a ulsfo backend with vary:x-seven, and no vary in eqiad? [16:33:59] oh I see [16:34:05] yes, possibly! [16:43:02] bblack: amending the patch with tests and the x-next-is-cache thing [16:43:39] ok [16:48:04] bblack: done! [17:17:03] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Anomie) > server I note that with X-Wikimedia-Debug it seems you have to specify a backend, so this wouldn't be terribly useful there eit... [17:34:18] so, when I added grafana-beta.wikimedia.org, IIRC I just puppet-merged the text.yaml hiera change and let it roll out to the varnishes as puppet ran. but I'd like to switch the backend of grafana.wikimedia.org to the new host relatively quickly. is this a job for cumin? something else? [17:39:02] cdanis: yeah, that's indeed a job for cumin. You can force a puppet run on the eqiad/codfw caches with `cumin -b1 'A:cp-text_eqiad or A:cp-text_codfw' 'run-puppet-agent -q'` [17:39:32] cool, thanks! any idea how long i should expect from start to finish? [17:40:43] a couple minutes? [17:41:20] ok that's not bad, thank you [17:42:40] can try it now with no real pending changes. most of the timeline for a parallel puppet run isn't in the deployment of that one diff anyways, it's the actual execution of the agent in general and its communications, etc. [17:43:12] oh I didn't see ema had -b1 on there above, that would be much slower [17:43:25] without the b1, a couple minutes [17:43:44] ah yes, -b1 not really necessary in this case! [17:43:58] cdanis: the "filter by" feature on grafana-beta does not seem to be working for me? If I click on the Tags dropdown and type in 'traffic', the traffic label does not get singled out [17:44:16] one moment [17:44:22] i just made its db read-only so that might be the issue [17:45:12] ... wow what the [17:45:51] also if I click on one of the 'traffic' labels in the 'Recent' list I get no matching dashboards [17:46:22] cdanis: can you reproduce too? [17:46:26] yes :( [17:46:28] :( [17:46:45] filtering from the tags list on the side works, for the tags you can select in that UI element that lacks a scrollbar :( [17:47:34] this is really unfortunate [17:48:08] the (similar but different) UI works in https://grafana-beta.wikimedia.org/dashboards [17:48:29] yes, that one works! [17:48:55] I am happy to file a bug with upstream [17:49:03] do you consider this to block the upgrade? [17:50:30] nah, that's not a blocker IMHO. Let's file a bug upstream though [17:50:38] will do [17:50:46] can you repro it in play.grafana.org? [17:51:33] yes, if you click on "Tags" and type in b-a-n-a-n-a, you still get all labels [17:52:15] s/labels/tags/ [17:53:05] yeah [17:59:32] ok [17:59:44] I'm filing a bug upstream [17:59:47] even made them a screen recording [18:03:27] lol, when clicking a tag name: Request URL: https://play.grafana.org/api/search?query=&starred=false&tag=undefined [18:05:58] lol [18:09:40] https://github.com/grafana/grafana/issues/14437 [18:09:58] (yes, i made the screen recording on windows; i already knew how to do screen recording on my gaming machine) [18:11:18] clicking on the right works though :D [18:11:23] haha yes! [18:11:31] and the UI at the /dashboards URL also works! [18:12:46] anyway ty ema for doing better quality control than myself or than upstream ;) [18:43:36] ohhh [18:43:40] 5.4.1 says "Dashboard Search: Fixed filtering by tag issues." [18:43:50] okay I guess I will attempt pulling that in to reprepro. [19:20:43] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Krinkle) >>! In T210484#4811185, @Anomie wrote: >> server > > I note that with X-Wikimedia-Debug it seems you have to specify a backend,... [19:24:08] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10ArielGlenn) [19:37:01] 10netops, 10Operations, 10ops-codfw: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [19:59:46] ema: 5.4.1 fixes the "click on tag names from the dashboard list" issue, but does not fix the "typing in tags filter dropdown box" [19:59:54] updating upstream bug and proceeding as discussed [21:07:25] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) a:03Gilles [21:28:14] 10Traffic, 10Operations, 10ops-ulsfo: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) [21:28:17] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10RobH) 05Open>03Resolved [21:30:12] 10Traffic, 10Operations, 10decommission, 10ops-ulsfo: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10RobH) [22:37:25] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) Talked to Faidon last week, we agreed that a mechanism to ignore AS paths learned from the route servers would be a useful thing to have and not only a hotfix for this issue.... [22:38:00] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) a:03ayounsi [22:56:54] 10Traffic, 10netops, 10Operations: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) [23:05:48] 10Traffic, 10netops, 10Operations: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) Added some context in the task description. In addition, 185.15.58.0/24 is currently reserved as a 2nd anycast range (since T98006 I think), which goes against the idea of segregating the whole... [23:31:34] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10faidon) - It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the wh...