[01:28:30] 10Traffic, 10Continuous-Integration-Infrastructure, 06Operations, 07Regression: Favicon broken on doc.wikimedia.org and integration.wikimedia.org (HTTP 500) - https://phabricator.wikimedia.org/T147814#2704346 (10BBlack) The response lacks `Content-Length` because it's sent with `Transfer-Encoding: chunked`... [11:30:38] ema: so I've been thinking about how to simplify all the backends stuff in VCL, continuing some earlier thread about this... [11:31:39] 1) It's safe to assume we only use "dynamic" (etcd) directors for varnish<->varnish, not applayer backends. splitting things up so that vslp and dynamic directors only applies to the varnish<->varnish case, and using simpler VCL for the applayer case, probably makes sense... [11:32:30] (and gets rid of the boolean "dynamic" - all varnish directors are dynamic, and non-varnish ones aren't) [11:33:24] 2) Our current way of doing dynamic is to have the go-templated directors.vcl define the backends from the etcd list, and then reload VCL, but that's not optimal for a handful of reasons.... [11:34:00] a) Reloading VCL constantly is awful in general, we really spam those reloads, and I'm pretty sure they're leaky in at least some cases, etc.... [11:34:25] b) obviously, the templating mess [11:35:34] c) removing and re-adding backends completely (as in undefined or defined + VCL reload) is not the same thing as setting backend health, especially for vslp (or in varnish5, its replacement "shard"), because that initializes a new hashcircle instead of doing the normal chash fallback [11:35:59] ... it works out fine in the common case luckily, but doesn't allow for rampup/warmup to function as they should [11:37:40] so, it would be better if the varnish backends list was static (no VCL reloads on depools, comes from hieradata like before), and etcd simply toggled backend health [11:38:27] which could do today by having it execute: "varnishadm -n whatever backend.set_health be_cp1099_eqiad_wmnet sick" for depool, and "varnishadm -n whatever backend.set_health be_cp1099_eqiad_wmnet auto" for pool [11:38:58] (+ some quibbles to sort out about syncing that state across VCL reloads) [11:39:38] (meaning confd will have to at least periodically poll "varnishadm backend.list" to see if a VCL reload has re-set some "sick" back to "auto"?) [11:39:51] I'm not even sure without testing whether that happens or not [11:40:36] in an ideal/future state, we simply create a vmod that polls etcd directly and toggles sick/auto for us and handles the details [11:43:16] (or implement as a new kind of "probe", but then I don't think varnish currently even handles 2+ probes per backend, either) [11:44:35] anyways, lots of little steps between here and there. the first one in refactoring-order is getting rid of the pointless 2layer middle-class in the inheritance tree for cleanup [11:44:50] (patch incoming now, will check with the compiler if it comes out sane) [11:45:03] https://gerrit.wikimedia.org/r/#/c/315236/ [11:58:02] nice [11:58:44] yes the etcd vmod seems great for the future [11:59:00] merging that now, compiler says noop [11:59:16] alright [11:59:48] I wouldn't be shocked if puppetmaster race conditions = a few temporary catalog fails [12:34:28] 10Traffic, 06Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2705399 (10BBlack) [12:35:48] 10Traffic, 06Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2705413 (10BBlack) [12:35:52] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 3 others: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2705415 (10BBlack) [12:37:03] 10Traffic, 06Operations: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2705418 (10BBlack) [12:38:10] 10Traffic, 06Operations: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705444 (10BBlack) [12:39:38] 10Traffic, 06Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2705399 (10BBlack) p:05Triage>03Normal [12:40:21] I took some test-stabs at refactoring backend definitions locally, but really we need to finish cleanup on cache_misc backends first, so, tasks arranged around that [12:40:39] (logstash, wdqs, rcstream, pybal_config) [12:42:33] OK, I'm trying to run pcc against the _text WIP patchset but I get jenkinsapi.custom_exceptions.TimeOut [12:42:36] grrr [12:43:27] oh and beta I guess, it uses IPs :P [12:45:19] ugh something changed in debian-unstable's default vim config, and now it mostly seems to never use either of the X paste buffers (default, clipboard) [12:45:51] (as in, when I select text in vim with the mouse, it's not available for either kind of pasting, because vim is trying to do something else magical and override that) [12:46:00] ditto for pastes into vim [12:47:11] 10Traffic, 06Operations: Use hostnames (not IPs) in deployment-prep varnish app_directors - https://phabricator.wikimedia.org/T147848#2705466 (10BBlack) [12:48:51] oh the joy of running unstable :) [12:54:46] bblack: hi! When you have time, would you mind to let me know if https://gerrit.wikimedia.org/r/#/c/314662 is worth it or not? [12:55:33] elukey: well a few general thoughts on that kind of thing: [12:56:48] 1. I'm a fan of Wextra (and really, lots of things beyond that, but they get trickier to configure or use), but Werror is probably a bad idea; sometimes it will trip you up where you have to live with a benign warning, or a new compiler version causes new warnings, etc... it's best left off IMHO, unless the project is huge and it's a policy thing and compiler versions are specified, etc... [12:57:37] 2. Be very careful in "fixing" them. sometimes what seems like an obvious trivial warning fix does fix the warning, but produces logically-incorrect code with new bugs.... [12:57:42] I haven't dug into all the specific changes yet [12:57:55] sure sure you are completely right [12:58:29] in fact I had to insert an horrible assignment to fix unused parameters [12:58:42] but without -Werror it wouldn't be a problem [12:59:27] there's usually a "right" way to fix things such that sane compilers work without warnings if they're not too new [12:59:52] but sometimes the "right" way ends up being a fairly deep and invasive change, and the trivial fixup makes things worse in the name of cleaning up the warning output [13:00:05] yep yep [13:00:34] overall I think that the fixes (except the assignment) are "good" (but I have tested vk briefly up to know) [13:00:36] you could probably make some kind of distallation of "best C practices" that if followed avoid most warnings even in modern compilers [13:01:09] the problem comes in that usually those practices would have to be followed in all code in the entire stack to work right, including standard C library headers, which often fail at it :P [13:01:18] :D [13:01:20] and somewhere there's an interface boundary where things are borked [13:01:37] (even if not the standard C library, some other dependency or other code you don't want to or can't touch) [13:04:02] so maybe I can rework a bit this code review to 1) remove -Werror and 2) remove the horrible assignment, leaving with a warning for unused parameter [13:04:47] then the next step is to add basic tests to the format_parse function, refactoring it a bit since it is a big monster atm [13:05:21] looking at what's there: there seems to be some confusion around the varlen stuff with size_t vs int vs ssize_t [13:05:40] it's still ssize_t in the function argument, but s/int/size_t/ elsewhere, and then the check is converted from ==-1 to ! [13:05:53] ssize_t is signed (like int), size_t is unsigned [13:06:27] but if you're coming from a difference, a negative result *is* possible [13:06:43] and if you cast that to size_t, it's just wrapping it to an unreasonable value I think [13:07:01] e.g. where the new code has: (size_t)(t-ptr) [13:07:34] oh yes I missed the ssize_t in the parameter, snap :( [13:07:36] if you really want to have a size_t-clean (always >= 0) value elsewhere, but the initial difference could go negative, you can get there in two steps [13:07:41] e.g.: [13:07:52] ssize_t foo_signed = t - ptr; [13:08:08] if (foo_signed < 0) { throw_some_fatal_error_or_whatever; } [13:08:16] size_t foo = (size_t)foo_signed [13:08:47] or if you think earlier constraints make t-ptr<0 impossible, there's probably a way to rearrange the math to make it more explicit [13:08:53] so this would catch horrible states rather than simply casting [13:09:53] if they're actually pointers, there's also (in C99 anyways) ptrdiff_t to be more-explicit as well (which is probably usually the same thing as ssize_t) [13:11:58] re: ==-1 to ! - I also switched the varlen base case for tag_add, since afaics the -1 is used only in there [13:12:11] var ? varlen : 0, [13:12:57] as discussed a while back :in the long run, we should probably still look at switching this to a varnishncsa wrapper in python anyways [13:13:11] so that we automagically keep up with upstream fixups to the VSM interfaces, etc [13:13:43] (just to put some context on how much we care about long-term quality stuff) [13:15:34] sure sure, even if I still think that vk is a good software that needs only a bit of care :) [14:27:17] 10Traffic, 06Operations: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705444 (10mark) Well... LVS itself relies on pybal_config of course... :) [14:32:34] alright I've ported the pure-VCL part of cache_text to v4 https://gerrit.wikimedia.org/r/#/c/314716/, noop on v3 and fully untested on v4 but it compiles on both frontend and backend [14:32:45] now for the inline-C... [14:33:30] I've tried some quick replacements to see how much has changed (eg: const struct sess* sp -> const struct vrt_ctx *ctx) [14:33:32] ship it [14:33:40] :) [14:33:56] then a few functions have changed, such as VRT_GetHdr, and those are easy to port [14:34:30] I'm not really sure how to proceed with the functions calling other vmods, they don't seem to work anymore [14:34:36] vgc.c:5289:9: error: ‘Vmod_Func_header’ undeclared (first use in this function) Vmod_Func_header.append(ctx, HDR_RESP, "\013Set-Cookie:", [14:35:59] the docs are, let's say, terse regarding inline-C https://www.varnish-cache.org/docs/trunk/users-guide/vcl-inline-c.html [14:39:24] ema: the inline C is confined to geoip at this point, right? or are there other cases I'm forgetting? [14:39:59] Vmod_Func_header() was coming from the vmod_header module, that might all change on v4 anyways [14:41:26] we could ignore it for now for initial offline testing, and write a vmod to replace it [14:41:40] bblack: there's normalize_path as well but that's alright to port [14:41:46] translating what's there to a vmod isn't fundamentally difficult, and would be cleaner on the interface boundaries to the calling VCL [14:42:05] ah yes, normalize_path.... [14:42:20] in the long run, that should be both more-flexible and a vmod, but if it works it works for now [14:42:55] yeah, normalize_path is really just const struct sess* sp -> const struct vrt_ctx *ctx [14:43:01] and surely there must be alternative geoip implementations out there by now? [14:43:48] there are like 20 of them or something, but they all have either deficiencies or mismatches to our use-case on the interfacing [14:44:01] we could pick the best/closest and maybe patch it up [14:44:52] the most annoying part is that all the good geoip/maxminddb vmod names are "taken" now by stuff that's not good enough :) [14:45:09] ours originates from arthur bergman's @ wikia [14:45:52] or, well, it did [14:45:57] i don't recognize much of it by now ;) [14:46:55] yeah that was faidon's initial port of his stuff to inline-C [14:47:06] but I rewrote most of it in the end and modeled things a bit differently [14:55:22] 10Traffic, 06Operations: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705784 (10BBlack) I don't think it uses it directly. LVS/pybal talks directly to etcd, whereas this is just an HTTP view of the same data. [14:56:39] 10Traffic, 06Operations: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705785 (10BBlack) (well, beyond that, I think it's exposing the old text files even, not the new etcd data?). [14:57:57] 10Traffic, 06Operations: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705444 (10Joe) @bblack you are correct saying it isn't used by pybal anymore, but incorrect when saying it's exposing the old text files: the current files are generated from etcd. [14:59:24] 10Traffic, 06Operations: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2705793 (10BBlack) ah ok, the format threw me off, it makes sense now :) [15:06:30] oh right [15:06:33] i keep forgetting about etcd [17:05:27] what's the misc-web-lb ulsfo outage listed on the 5th? [17:05:32] I don't remember it offhand [17:08:22] I also don't [17:10:26] it was when ulsfo was drained [17:10:29] for the cr upgrades [17:10:36] so not an outage, just a random page [17:40:37] <_joe_> bblack: am I blocking you? [17:40:43] <_joe_> I didn't think I was [17:45:19] <_joe_> uhm no I see both patches have been merged, it was just "look at it" in general :) [17:55:12] _joe_: that was for run-no-puppet? yeah no blocker there, just pointing out its utility for others [17:56:20] <_joe_> yeah it is :) [17:57:18] there's an undocumented "feature" for more-complex commandlines and/or where you don't care about preserving an existing puppet-disable, too [17:57:25] where you can use it like: [17:57:43] puppet agent --disable whocares; run-no-puppet echo; blah blah blah; [17:58:11] and run-no-puppet in this case is just a barrier to ensure any agent run that had already started finishes before you continue [19:10:35] 10Traffic, 10Continuous-Integration-Infrastructure, 06Operations, 13Patch-For-Review, 07Regression: Favicon broken on doc.wikimedia.org and integration.wikimedia.org (HTTP 500) - https://phabricator.wikimedia.org/T147814#2706944 (10Krinkle) 05Open>03Resolved a:03Krinkle [20:22:40] 10Domains, 10Traffic: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707138 (10Zppix) [20:23:09] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707151 (10Zppix) [20:27:14] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707138 (10Krenair) I think so yes. Why is this #Domains? [20:29:42] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707157 (10jeremyb) see also {T88859}, cc @Slaporte [20:30:24] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707164 (10Zppix) @jeremyb its restricted [20:37:19] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707138 (10Dzahn) see T44085 ? [20:38:55] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707198 (10Zppix) [20:40:15] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707214 (10Dzahn) https://w.wiki/ -> https://meta.wikimedia.org/wiki/Special:UrlShortener though "Creating new short URLs is temporarily disabled. " [20:46:02] 10Domains, 10Traffic, 06Operations: Mediawiki short url - https://phabricator.wikimedia.org/T147887#2707233 (10jeremyb) >>! In T147887#2707164, @Zppix wrote: > @jeremyb its restricted yes, sorry, I hadn't noticed that. [22:37:53] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2707669 (10GWicke) @brion, this task looks pretty stalled by now. Do you see a chance for reviving it any time... [23:36:42] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2707752 (10brion) @GWicke Yes, I think we should revive it as part of a concerted effort between Editing/Parsi...