[09:32:16] 10netops, 10Operations: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10dkg) hm, it might be nice to have access to that space in `/srv`, but i don't think it's necessary right now. it looks like some extra space was already freed up earlier, and i've freed up mo... [09:35:49] 10netops, 10Operations: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10elukey) I would spend a bit of time trying to move /var/lib/postresql to /srv to avoid the recurrence of this issue, a root partition so small is not meant to keep database data in my opinion :) [10:50:58] 10Traffic, 10Operations: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789604 (10Vgutierrez) For future references, nginx now (since 1.13.1) workarounds this issue setting TCP_NODELAY before doing the handshake: https://trac.nginx.org/nginx/ticket/413#comment:8 OpenSSL removed acces... [12:38:09] 10Traffic, 10Operations, 10Performance, 10Performance-Team (Radar): missing H2 coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#3985229 (10BBlack) We actually do use the same cert for both, so we don't need the secondary certs bit. Remaining b... [12:44:26] vgutierrez: I see you've been digging into our nginx layer issues, nice! :) [12:44:45] 10netops, 10Operations: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10akosiaris) What exactly is postgresql doing on that machine without it being puppetized ? There's a very strict rule against this and is very clearly spelled out in L3. [12:45:24] vgutierrez: have you had a chance to take a look through our current set of custom patchwork? [12:46:00] I can sort of rundown here for your later perusal, some short thoughts/explanations on them: [12:47:04] 0100-dynamic-tls-records.patch - this is an import of Cloudflare's dynamic TLS record size stuff. Tuning is currently guesswork at best and no proven benefit, but we suspect it helps. It would be nice to do some better tuning of this and/or adopt better approaches in the future. [12:47:49] 0500-ssl-curve.patch - this doesn't affect the connection, but is there to allow us to log stats on negotiated ECDHE curves (e.g. x25519 vs prime256v1 in UA adoption) [12:49:13] 0600-stapling-multi-file.patch - this allows us to use nginx manual stapling_file with nginx's support of multiple certs (ECDSA+RSA). This is kind of a complicated topic... [12:50:09] 0660-version-too-low.patch - this was put in place quite a while ago, because we were getting nginx log spam from SSLv3 client connection attempts. I don't actually know if it's necessary anymore, or maybe there's a simpler way to prevent the spam, or maybe SSLv3 connection attempts are rarer now than they used to be anyways? [12:50:54] 0700-do-wait-shutdown.patch - was an experiment to see if we could reduce RST rate in case this was the cause, but the experiment was a failure and we should probably withdraw it in our next build. [12:53:17] ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/nginx/+/refs/heads/master/debian/patches/ ) [12:54:00] rewinding back to assumptions/basics on the complex topic of the stapling-multi-file patch: [12:55:30] we deploy parallel ECDSA+RSA certs for a couple of reasons: we can't dump legacy RSA yet (UA compat), ECDSA is more-efficient for the vast majority that can already use it. Also, due mostly to certain popular versions of IE, supporting ECDSA certs at all increases our desirably-high percentages of forward-secrecy+AEAD. [12:56:34] nginx supports those parallel certs in general now upstream (there was a time when even that was a custom patch for us), but only supports their built-in stapling for it. they don't (yet? ever?) support multiple certs + ssl_stapling_file -based external stapling, which this patch adds. [12:57:15] we use ssl_stapling_file and an external stapling script because the built-in nginx automatic stapler has corner-case issues about handling startup and staple-refresh correctly. [12:57:42] (it has gaps where it will fail to staple responses while waiting on initial upstream staple response, and it doesn't try to refresh early with overlap, etc) [12:59:50] 10netops, 10Operations: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3985286 (10faidon) 05Open>03Resolved a:03faidon I've deleted a 7.7G file and freed up some space. As for Postgres, it's for a temporary situation for a bit of a high-priority and unusual situation,... [13:24:28] bblack: yup, moritzm pointed that out :) [13:24:59] (re: T150561) [13:24:59] T150561: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561 [13:26:00] yeah that one we patched libssl for :) [13:26:35] the explanation on trac#413 though seems to leave some window of possibility that the NODELAY fix may still be less-optimal than just having a >4K buffer configured. We'll have to test and see! [13:27:39] actually the TCP_NODELAY + your patch should improve the handshake timings [13:31:30] yeah, probably did :) [13:31:36] (we're already past that patch, I think) [13:33:43] 10Traffic, 10Operations: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3985419 (10ema) [13:33:57] 10Traffic, 10Operations: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3985430 (10ema) p:05Triage>03Low [13:34:04] Installed: 1.13.6-2+wmf1~jessie1 [13:34:07] bblack: yup [13:41:41] .win 15 [13:49:32] 10Traffic, 10Operations, 10Performance, 10Performance-Team (Radar): missing H2 coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#3985528 (10Gilles) > legacy HTTP/1 UAs may suffer due to UA limits I believe that the connection limit UAs have fo... [15:20:19] <_joe_> ema, bblack I would like to upgrade confctl on the cache hosts, but given its critical nature, I'd prefer to do it tomorrow morning/early afternoon at the latest [15:30:20] ok [15:34:13] _joe_: sounds good [15:34:46] I guess there will be some interlock between deploying+testing that and the first cache_text v5s, since we use confctl to self-depool [15:34:58] should be manageable! :) [15:35:50] <_joe_> bblack: actually, it should all be zero-issues zero-downtime, as the only incompatible change doesn't touch the primitives you're using [15:37:33] <_joe_> btw, now you can do magic things like [15:37:35] <_joe_> sudo -i EDITOR=emacs confctl select 'dc=esams,name=cp3033.esams.wmnet' edit [15:37:51] <_joe_> but I just saw a UI bug there :P [15:38:01] <_joe_> luckily it's just a beta version! [15:38:08] * bblack gets the rubbing alcohol out to cleanse the channel of emacs taint [15:38:28] <_joe_> ahah [15:38:34] <_joe_> easiest troll ever [15:38:41] <_joe_> ok, gonna brew some coffee [15:38:44] <_joe_> ttyl [15:39:11] btw the varnish-{front,back}end-restart scripts can be simplified quite a bit now [15:40:04] oh? [15:40:50] you can do: depool nginx [15:40:57] to depool a single service on that host [15:40:58] (btw they should really have implicit no-run-puppet within them, right now they rely on people remembering to them under run-no-puppet) [15:41:50] (also, varnish-frontend-restart lacks the protection varnish-backend-restart has against operating on an already-depooled daemon...) [15:42:08] and then there's the whole fallocate vs mkfs topic from last week :) [15:43:09] yeah, I should have said that better, the confctl part can be simplified :D [15:43:29] yeah except the protection bit... [15:43:31] hmmm [15:43:54] is there some way we could get depool flags to handle that case? [15:44:20] <_joe_> ? [15:44:33] depool -o nginx (or whatever -o should be, I'm not even sure what word would describe it properly) [15:44:43] which would depool nginx and return exit status zero if nginx was pooled [15:44:53] but do nothing and exit non-zero if nginx was already depooled [15:45:19] https://github.com/wikimedia/puppet/blob/2fed306e634e3e0a8483bd78f834b8b717e6f369/modules/conftool/files/conftool-simple-command.sh [15:45:24] in other words, we want the opposite of idempotence. we want success only if change is possible. [15:45:54] 68 lines of bash sounds like a python conversion! :) [15:46:04] <_joe_> bblack: almost, I was on the verge [15:46:56] <_joe_> bblack: so you want a script that returns 0 if it changes something, non-zero otherwise? [15:47:14] the underlying rationale for the non-idempotent version, is that if it's already depooled it's probably for a reason, in which case we probably don't want the script restarting a service someone's already working on something about, and then (b) even if restart was ok, we don't want the script to then repool a service it didn't depool at the end. [15:47:51] since we already set -e, just having the depool command exit non-zero if already-depooled would accomplish that [15:47:52] bblack: so you want an is_pooled [15:48:01] <_joe_> bblack: so you want the ability to say --prev-status pooled=yes [15:48:15] maybe [15:48:19] <_joe_> and error out if that's not the case? [15:48:34] in an ideal world, it would be a transaction rather than a racy check, but even a racy check is much better than no check. [15:48:56] <_joe_> I honestly think these things should be done using conftool as a library for a specialized python script [15:49:03] yeah either an is_pooled command we could use to error out at the top of the script. [15:49:31] or --prev-status check would work, or a flag that says "exit non-zero if it's already in the desired state" [15:49:36] (anti-idempotence) [15:51:23] this all reminds me of my long-running complaint that both puppet disables and confctl depools should have arrays of reasons [15:52:22] so that a script can depool/repool (or puppet-disable->puppet-enable) for a given reason-string, and this just adds to the set of reasons and then later removes that reason, but the state remains effectively disabled or depooled if other reasons remain. [15:53:53] in practice, that seems hard to manage without update races for etcd, though [15:54:19] etcd3 has transactions right? [15:54:50] yeah I guess if you completely changed the data type of pooled, you might not even need that. [15:55:36] something more like being pooled is implied by "depooled=[]", and being depooled is anytime the array is non-empty like "depooled=[bblack-testing-something,varnish-backend-restart-script,hwfail-T99999]" [15:55:37] T99999: Incorrect prior block durations reported - https://phabricator.wikimedia.org/T99999 [15:55:42] heh [15:55:48] lol [15:56:12] yeah, we also have mediawiki that has the 3-way state, yes/no/inactive [15:56:19] <_joe_> bblack: that would be interesting to parse via confd [15:56:21] <_joe_> :P [15:56:33] <_joe_> volans: that's actually pybal, not mediawiki [15:56:55] yeah, true [15:57:01] <_joe_> we just resused it for dsh [15:57:41] but really "inactive" could be modeled as yes+weight=0 for most things, assuming pybal+ipvs support it. [15:59:13] <_joe_> bblack: no, "inactive" is "not in pybal even logically" [15:59:28] oh [15:59:30] so the other way around :) [15:59:35] <_joe_> yes [16:00:09] assuming etcd doesn't support arrays natively, you could treat it as a string with the tooling I guess, but then you get into races about multiple tools doing get->localmod->set on the array [16:00:10] <_joe_> bblack: when you sell someone me and ema can work for a quarter on pybal + conftool full-time, these things might be happening [16:00:14] vs atomic array ops in etcd [16:00:21] <_joe_> you're the manager, make it happen :D [16:00:25] lol [16:02:01] :) [16:19:32] mwahahaha [16:20:08] 10Traffic, 10Operations, 10Patch-For-Review: Non zero rated LVS IPs - https://phabricator.wikimedia.org/T170518#3986229 (10BBlack) 05stalled>03declined In light of: https://blog.wikimedia.org/2018/02/16/partnerships-new-approach/ , we're not going to restructure public subnets around this, as that has lo... [17:01:25] 10netops, 10Operations: cr1-eqsin faulty interfaces - https://phabricator.wikimedia.org/T187807#3986478 (10ayounsi) [17:22:35] 10Traffic, 10Operations: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3985419 (10BBlack) This could potentially be a large contributor to memory pressure issues we run into elsewhere, as well (and the inconsistencies around these, which may have to do with average reloads rates vs rest... [17:42:28] 10netops, 10Operations, 10ops-codfw: codfw: mgmt switch replacement in D4 - https://phabricator.wikimedia.org/T187816#3986725 (10Papaul) p:05Triage>03Normal [17:43:58] 10netops, 10Operations, 10ops-codfw: codfw: mgmt switch replacement in D4 - https://phabricator.wikimedia.org/T187816#3986741 (10Papaul) [18:47:13] 10netops, 10Operations: Rack/cable/configure mr1-eqiad - https://phabricator.wikimedia.org/T187820#3986943 (10ayounsi) p:05Triage>03Normal [22:33:50] https://news.ycombinator.com/item?id=16419039 [23:47:18] 10Wikimedia-Apache-configuration, 10Patch-For-Review: techblog.wikimedia.org should redirect to blog.wikimedia.org/c/technology - https://phabricator.wikimedia.org/T181878#3805204 (10Dzahn) ``` [tin:~] $ apache-fast-test T181878 mwdebug1001 ... http://techblog.wikimedia.org * 301 Moved Permanently http://blog... [23:48:37] 10Wikimedia-Apache-configuration, 10Patch-For-Review: techblog.wikimedia.org should redirect to blog.wikimedia.org/c/technology - https://phabricator.wikimedia.org/T181878#3987728 (10Dzahn) Should be resolved once puppet ran on all appservers and minus caching on different levels. I tested, deployed and confi... [23:50:13] 10Wikimedia-Apache-configuration, 10Patch-For-Review: techblog.wikimedia.org should redirect to blog.wikimedia.org/c/technology - https://phabricator.wikimedia.org/T181878#3987732 (10Dzahn) [tin:~] $ curl -vvv https://techblog.wikimedia.org 2>/dev/null| grep moved

The document has moved