[00:16:09] 7HTTPS, 10OTRS, 6Security, 6operations: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1996417 (10Dzahn) we should check one more time since today OTRS moved to mendelevium. i think we already fixed everything except the DNSSEC and DANE part before though, as Jan Zerebecki s... [00:19:37] 7HTTPS, 10OTRS, 6operations: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1996427 (10Dzahn) The OTRS upgrade has happened. I think this can follow soon. [00:25:43] 7HTTPS, 10OTRS, 6Security, 6operations: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1996450 (10Dzahn) https://www.ssllabs.com/ssltest/analyze.html?d=ticket.wikimedia.org grade A @DaBPunkt i would like to claim this is resolved. All the main things you listed when this w... [00:26:01] 7HTTPS, 6operations: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1996453 (10Dzahn) [00:26:04] 7HTTPS, 10OTRS, 6Security, 6operations: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1996451 (10Dzahn) 5Open>3Resolved a:3Dzahn [08:38:19] 10Traffic, 6operations, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1997024 (10Joe) Ok I think I pinned down a case where this can happen: whenever we delete a directory, what happens is that etcd sends down the wire ``` {"action":"delete","node":{"key":"/testdir","dir"... [08:39:12] 10Traffic, 6operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1996717 (10Bianjiang) With T78676 and other related efforts (separating content into different APIs), 50 req/s global limit (limit by UserAgent?) is not enough even for regular incremental craw... [08:51:12] 10Traffic, 6operations, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1997048 (10Joe) That happens more in general when we remove a node. [09:07:44] 10Traffic, 6operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1996743 (10GWicke) 5declined>3Open [09:23:01] 10Traffic, 6operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1996757 (10GWicke) Reopening, reflecting the ongoing discussion. [09:59:38] <_joe_> ok, bug reproduced [09:59:41] <_joe_> https://integration.wikimedia.org/ci/job/tox-jessie/4404/console [10:53:55] 7HTTPS, 6Research-and-Data, 10The-Wikipedia-Library, 10Wikimedia-General-or-Unknown, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#1997286 (10Johan) I'll mention it in Tech News. This will go into production on F... [11:27:14] 10Traffic, 6operations: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1997393 (10MoritzMuehlenhoff) I looked into this for a bit, but couldn't really pin this down cp3042/cp3049 have had the same crash in RCU handling. This _might_ have been fixed by this commit wh... [11:34:52] 10Traffic, 10MediaWiki-Interface, 6operations, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997420 (10Joe) [13:16:23] 10Traffic, 10MediaWiki-Interface, 6operations, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997604 (10Krenair) [13:50:12] 10Traffic, 6operations, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1997739 (10ema) libvmod-tbf packaged and [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/varnish/libvmod-tbf | pushed to gerrit ]]. Some additional work wa... [13:55:47] <_joe_> ema: you and bblack should really take a look at https://phabricator.wikimedia.org/T124356 [13:57:32] 10Traffic, 6operations, 5Patch-For-Review, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1997767 (10Joe) I just wrote some unit tests and they confirm my suspicion: we do not manage DELETE in etcd within pybal. The tests can be seen in the (still failing) PS [14:02:36] _joe_: I took a look at the issue [14:03:02] I'd wait for bblack to show up though before acting [14:06:11] <_joe_> yeah +1 [14:07:09] I'm here -ish [14:09:34] <_joe_> bblack: no rush :) [14:12:39] the patches look sane -ish :) [14:13:18] <_joe_> which ones? [14:14:23] <_joe_> I guess the pybal ones [14:18:37] yeah the pybal ones [14:18:40] 10Traffic, 10MediaWiki-Interface, 6operations, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997851 (10BBlack) Can we get some more-spe... [14:18:53] that parser caches have apparently been corrupting for a week or two and I'm just now hearing about it awesome too though! [14:20:38] > This page is full of old bullshit that isn't true anymore. Don't believe a word of it. [14:20:53] (trying to find out more about parser caches) [14:21:15] ema: honestly I've never dug deep enough to fully understand the parser cache to my liking [14:21:35] bblack: in one sentence, what is it? :) [14:21:51] but the bottom line is that when MW generates the HTML output of a page, it stores it in some persistent cache that it can invalidate, which is all totally separate/before we get to varnish stuff [14:22:08] oooooh [14:22:35] and it can do complex code checks on invalidating *that* on minor code changes that affect output and stuff [14:23:17] and also, generally when they purge something from the parser cache, it also triggers a varnish purge of the same object [14:23:31] at least, some mechanisms that do one do the other, that I'm aware of [14:24:14] but that's not the only part of MW that triggers varnish purges, or is it? [14:24:19] right [14:24:21] it's not [14:25:01] I think there are multiple pathways to triggering varnish purges and/or invalidating a parser cache object, and some of them are linked up to each other so that both happen, but I don't know that all are. [14:26:22] from an "applayer as a black box" perspective though, in a situation like this, question #1 for them is "is MW done ever outputting bad output from this bug?" (which implies they've killed any such entries in the parser cache too, or some code will invalidate them as it reads them at least, or whatever) [14:26:36] and then question #2 is "ok, but varnish is still outputting the bad pages?" [14:27:10] and then "how do we ban them on some unique attributes, given we can't just 'wipe the cache plzthx'?" [14:27:49] right :) [14:29:18] one thing that would be nice (and this isn't the first time it's crossed my mind, should make a ticket) is to have MW generate some code-version headers [14:29:37] ah cool [14:29:58] so we could invalidate all contents produced by a specific version at once I guess [14:29:59] maybe like: X-MW-Version: c=1.27-wmf.11,o=1.27-wmf.12 [14:30:19] where c= and o= are the versions that generated the parsercache text and that actually output the final text to varnish [14:30:46] yeah given how we roll through versioned releases, invalidating everything generated by just version = current-2 may not be the end of the world. [14:30:57] if we had that data in a header to ban on [14:35:45] ema: so at the MediaWiki level they can deal with it like so apparently: https://gerrit.wikimedia.org/r/#/c/266399/2/includes/MobileFrontend.hooks.php [14:36:14] (they get the chance to reject (effectively invalidate) the parser cache when something fetches it, based on complex conditionals) [14:36:26] but obviously that never gets run if the object is a hit in varnish [14:40:27] ema: in other news, I prepped up some patches yesterday to finish up mobile decom and maps provisioning, etc [14:40:30] https://gerrit.wikimedia.org/r/#/q/status:open+topic:cache-mobile-decom,n,z [14:40:40] https://gerrit.wikimedia.org/r/#/q/status:open+topic:cache-maps,n,z [14:41:43] the mobile-decom LVS patches are the first ones to go out of all of that [14:42:19] in *theory*, they Just Work, as the net change in the set of lvs configured listening IPs is a no-op, and the pybal config change will wait for our manual pybal restarts [14:43:06] but in practice, we probably want to (a) compiler-check them then (b) disable puppet on all lvs* and cache_text before merging and then (c) carefully test it on one text node and one backup LVS first, etc... [14:43:43] some manual review of "this doesn't look like someone made a stupid typo" is in order before all of that if you have time too :) [14:48:40] well now that I look at the patch series structure, I should add (d) disable puppet on the mobile caches during some of this too, and probably the 2x mobile patches could've gone together in one, but whatever [14:49:08] (not that the mobile caches do anything functional now, but the 2/2 LVS patch will probably break their confd config stuff) [14:49:48] in any case, only the first patch is really ready to go, we should probably wait on pybal delete fix before others JIC [14:50:32] nice! That's a lot of patches :) [15:00:03] <_joe_> bblack: my tests could crash the etcd routine on the codfw low-traffic lvs servers [15:00:07] <_joe_> is that ok? [15:02:22] it's ok to me, none of those are public-facing AFAIK [15:02:28] but what do I know? :) [15:02:55] (by that I mean, maybe I don't understand dependencies of some natively-x-dc services these days, and it does matter) [15:05:13] <_joe_> it should be ok [15:05:25] <_joe_> just crashing the etcd routine is not the end of the world [15:06:00] <_joe_> ok, it crashed on my test host [15:06:09] <_joe_> and on the codfw lvss too [15:07:10] <_joe_> I'll wait to be done with the tests before restarting them [15:08:32] ok cool [15:09:02] <_joe_> pybal] INFO: [text] Removing server mw2086.codfw.wmnet (no longer found in new configuration) [15:09:05] <_joe_> it works :) [15:09:52] \o/ [15:10:20] <_joe_> and pooled=inactive works too [15:10:22] <_joe_> :) [15:10:58] so pooled=inactive basically means "do the same thing as if it were deleted from the list in puppet, but with an undo-able confctl update to live data only" [15:11:30] <_joe_> yup [15:11:30] the big difference in practice aside from puppet being that pybal won't poll that backend for healthchecks either [15:11:44] <_joe_> and won't count it towards the depool threshold [15:11:46] (it still polls pooled=no even though they're not configured in ipvsadm) [15:11:49] oh yeah that too [15:12:04] <_joe_> it's practically removed from the equation [15:13:32] _joe_: before you move on rolling up a new package and all that [15:13:58] I never even filed a bug yesterday, but there's the whole thing about removing an entire service via puppet doesn't remove its nodes from etcd [15:14:09] but now that I typed that, I guess it's a different package anyways heh [15:14:15] <_joe_> yes, that has nothing to do with pybal :) [15:14:47] <_joe_> in that case, I think I advised to do two commits: first remove the nodes, then the service, for now [15:19:22] <_joe_> bblack: I'm going to merge these patches and prepare a package, probably [15:19:38] <_joe_> I'm unsure I'll have the time to install it around today though [15:20:31] 10Traffic, 6operations, 5Patch-For-Review, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1998073 (10Joe) Problem reproduced and it's properly fixed when https://gerrit.wikimedia.org/r/268361 is applied. [15:22:40] _joe_: ok [15:22:57] in any case, at least we know the trigger for sure, we know what to avoid and/or workaround if need be [15:24:18] (for instance, we're not deleting nodes for the kernel reboots, so I could still start those today anyways) [15:25:35] <_joe_> ok [15:32:20] next up on that front is figuring out the best current way to automate it to the degree possible [15:32:39] we have "pool" and "depool" now, and we have an icinga downtime script on neon too [15:33:07] it's fairly trivial to script up a loop on neodymium that uses salt to downtime in icinga, pause, depool, pause, reboot [15:33:35] I don't know that I have an awesome way to repool after automagically, unless I script up something for one-shot-on-bootup for now [15:34:40] what I do have so far is a list of nodes that need the reboot, which has been interleaved/shuffled to get nodes from same cluster and/or dc spaced out from each other a bit, so that we can rip through the list reasonably-fast [15:35:26] but reasonably-fast should be inclusive of "make sure we're not backlogging a bunch of un-repooled nodes in our wake" :) [15:36:00] <_joe_> to magically repool, one could create a systemd unit that depends on both varnish instances [15:36:15] <_joe_> and that just fires "repool" when both have been activated [15:36:38] in the general case we probably don't want that usually active, because on a crash (or especially if it was down a few hours) we don't want autorepool [15:36:41] but in this case we do [15:36:58] <_joe_> ok [15:37:00] but that could always be a "touch /etc/repool-me" and have the unit remove it when it detects it to do so [15:37:07] <_joe_> so we don't want to repool in general [15:37:08] <_joe_> and yes [15:37:14] <_joe_> that too :) [15:37:49] yeah I guess the reboots aren't ultimately that time-critical, I could spend time on that unit today first [15:38:48] <_joe_> ouch, my unit tests fail on our CI because jessie has an old-ass version of python's mock package [15:38:52] <_joe_> sigh [15:40:32] :) [15:48:18] bblack: https://phabricator.wikimedia.org/P2566 [15:57:10] it's too bad the diff comes out so ugly, the real net change is pretty minimal if the stanzas were in the same order lol [16:01:39] :) [16:03:16] 10Traffic, 6operations, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1998203 (10Aklapper) This task has been "Unbreak now" priority since it was created and has seen no updates for nearly two months. [[ https://www.mediawiki.org/w... [16:17:51] 10Traffic, 6operations, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1998273 (10Joe) @Aklapper the bug is solved in the code but needs to be deployed to production, which will happen very soon. [16:18:14] <_joe_> AssertionError: Expected call: mock() [16:18:17] <_joe_> Actual call: mock() [16:18:27] <_joe_> you should love em python test frameworks... [16:24:00] _joe_: I don't get it, isn't it the same call? Also, 404 OK? [16:30:29] 10Traffic, 10MediaWiki-API, 6Services, 6operations, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1998311 (10GWicke) I have set up a basic latency and request rate dashboard using the Varnish metrics at https://grafana.wikimedia.org/dashb... [17:16:34] 10Traffic, 10MediaWiki-Interface, 6operations, 5MW-1.27-release, and 5 others: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#1998513 (10Danny_B) 3NEW [17:16:42] 7HTTPS, 10OTRS, 6Security, 6operations: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1998519 (10DaBPunkt) >>! In T91504#1996450, @Dzahn wrote: > https://www.ssllabs.com/ssltest/analyze.html?d=ticket.wikimedia.org grade A > > @DaBPunkt i would like to claim this is resolv... [17:17:47] 10Traffic, 10MediaWiki-Interface, 6operations, 5MW-1.27-release, and 5 others: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#1998513 (10Danny_B) [17:18:47] 10Traffic, 10MediaWiki-Interface, 6operations, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1954680 (10Danny_B) I splitted the cache pu... [17:23:15] https://gerrit.wikimedia.org/r/#/c/268420/ -> traffic-pool for systemd [17:23:26] first draft anyways, need to test and play [17:25:16] 10Traffic, 6operations, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1998562 (10ema) libvmod-vslp packaged and [[https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/varnish/libvmod-vslp|pushed to gerrit]]. [19:44:17] 7HTTPS, 6Research-and-Data, 10The-Wikipedia-Library, 10Wikimedia-General-or-Unknown, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#1999439 (10Sadads) @Johan Thats the goal. It might be afterwards [20:27:55] 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#1999652 (10Krinkle) [21:08:46] 10Traffic, 10MediaWiki-API, 6Services, 6operations, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1999899 (10GWicke) Code behind the current metrics is at https://github.com/wikimedia/operations-puppet/blob/c62c102e8/modules/varnish/files... [21:12:44] 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#1999935 (10Krinkle) Deployment strategy: 1. [x] [mediawiki/core] Change MediaWiki...