[04:12:46] 10Traffic, 10Deployments, 10Operations, 10Performance-Team, and 2 others: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#3728547 (10Krinkle) [04:13:04] 10Traffic, 10Operations, 10Performance-Team, 10Performance-Team-notice: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#3728550 (10Krinkle) [14:59:59] I will be like 2 minutes late! [15:00:31] ok! [16:00:01] wifi died 2min before the end [16:05:17] :) [16:06:09] bblack: when you've got a minute, this should do part of what varnishxcps does https://gerrit.wikimedia.org/r/#/c/388064/ [16:07:08] part of it because I haven't worked on the legacy split stats, do we still need those? [16:10:07] yeah so, I donno? it's about preserving the legacy graph really, needs some thought [16:10:25] the basic background is this: [16:10:50] OpenSSL+nginx gives us a ciphersuite string and an optional EC= parameter [16:11:31] so for example we might see "EC=undef C=AES128-SHA", or "EC=X25519 C=ECDHE-ECDSA-AES128-GCM-SHA256" [16:11:52] the "EC" parameter is really there to split out the different algorithms that are all "ECDHE" in the C= string [16:12:26] logically, the contents can be broken out into a bunch of distinct aspects of the TLS connection security: [16:13:05] (oh and there's the TLS version in the X-Conn-Props too, right? TLSv1.2 vs 1.1 vs 1.0) [16:14:48] TLS Version, Key Exchange, Authentication, and Cipher [16:15:13] but there's no easy 1:1 mapping of those to the parameters in X-Connection-Properties [16:15:24] sometimes KX/AUTH is implicit in the ciphersuite string, sometimes it isn't, etc [16:16:10] our legacy xcps stats output was/is consumed by: https://grafana.wikimedia.org/dashboard/db/tls-ciphers [16:16:28] it more-or-less does 1:1 mapping of X-C-P fields C=, EC=, etc... [16:16:57] so it gives us stats on whole ciphersuite strings, and it also gives us a breakdown of the EC= stats, but it doesn't correlate them [16:17:42] it can tells us TLSv1.2 percentage, or EC=X25519-vs-prime256v1, but it doesn't have correlated stats to ask questions like "How many EC=X25519 used TLSv1.2?" [16:18:26] and then also the "Ciphersuite" string was just copied from C=, and it encodes a lot of the other information.... [16:18:52] so it's hard to make queries like "How many used RSA authentication" (but possible) [16:19:04] ----- [16:19:23] the new hierarchical stats, break it down in the stat keys hierarchy, and correlates it all [16:19:28] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10fgiunchedi) >>! In T179050#3726279, @MoritzMuehlenhoff wrote: >>>! In T179050#3726257, @BBlack wrote: >> +1 We may as well move to stretch here. For the bastion/installserver role it should be... [16:19:51] so it logs stats of the form: tls.... = N [16:20:27] and we can run queries like summing (tls.tlsv1_2.*.*.aes128-sha) and find all tlsv1.2+aes128-sha regardless of key exchange and authentication. [16:20:32] or many other possibilities [16:21:32] https://grafana-admin.wikimedia.org/dashboard/db/tls-ciphersuite-explorer?orgId=1 explores that new data [16:21:48] 10Traffic, 10Operations: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3730005 (10fgiunchedi) +1 for ms-fe, I'm assuming the rollout will happen with puppet disabled and the progressively re-enabled ? [16:22:40] I *think* (pretty sure?) that most of the graphs in the older https://grafana.wikimedia.org/dashboard/db/tls-ciphers could be re-constructed using the newer-style data [16:23:02] at least, something fairly close to them, good enough for all functional purposes. and then still have the "explorer" for digging deep. [16:24:01] but I don't know if we can seamlessly integrate the legacy data + new data into a single view in a replacement/update of "tls-ciphers" (in order to kill logging the old form of the data, but still be able to see/correlate older history) [16:25:43] TL;DR-ing all of that rambling in a functional sense: [16:26:53] ema: Yes, we still need the legacy split stats working exactly like they do today, so we don't break our histories in the tls-ciphers grafana. But, maybe putting a little work into updating/revamping/replacing "tls-ciphers" can find a way to integrate old+new data over time and then we could kill the old data inputs going forward? [16:27:52] the easiest answer is probably just to make a new tls-ciphers (maybe just new panels on the "explorer" one) from the new data only, and leave the legacy data+dashboard behind, once we've got a few months' overlap. [16:28:41] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3730021 (10RobH) So this now reads 'bast4001' in the subject, but it is bast4002, just making sure no one changed that intentionally? (I setup the task as bast4002, so checking.) [16:28:47] (we have about 6 weeks so far) [16:33:23] or we can leave everything as is and wait till we have `tls{auth="ECDSA", key_exchange="prime256v1", version="TLSv1.2", cipher="aes128-gcm-sha256"} 42` in prometheus [16:33:50] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10Dzahn) +1 to 400**2** and stretch ! [16:33:51] I still have yet to create or modify any grafana-on-prometheus dashboard heh [16:33:54] and then revamp all dashboards and make them great with some very advanced promethes queries [16:34:03] :) [16:34:04] so I end up lacking a lot of context on what is or isn't possible in the new prometheus world :) [16:37:23] oh, everything is possible in the new prometheus world if you try to write a query, fail, and then ask godog [16:37:54] https://prometheus.io/docs/querying/basics/ [16:37:54] https://prometheus.io/docs/querying/functions/ [16:38:17] the examples are also pretty nice and clear https://prometheus.io/docs/querying/examples/ [16:39:13] heheh and grafana's prometheus query editor recently got much nicer to work with, e.g. with autocompletion [16:40:44] for exploratory purposes once the data is in prometheus you can also run queries through its native UI, sadly the production prometheus is only accessible via ssh-tunneling its port, haven't had the time yet to add ldap-apache-auth yet [17:00:25] quickly skimming through outrageously slow applayer responses on text: [17:00:34] $ varnishncsa -q 'Timestamp:Resp[2] > 10.0' -F '%{VSL:Timestamp:Resp}x %r' [17:00:37] 1509641994.426458 120.511203 0.000041 POST http://sk.wikipedia.org/api/rest_v1/transform/wikitext/to/html HTTP/1.1 [17:00:40] 1509641998.210704 67.169273 0.048716 POST http://sk.wikipedia.org/api/rest_v1/transform/wikitext/to/html HTTP/1.1 [17:00:43] 1509641999.160411 91.026539 0.029747 POST http://sk.wikipedia.org/api/rest_v1/transform/wikitext/to/html HTTP/1.1 [17:06:45] mmh lots of those are returning in either 120 or 240s [17:07:16] and it's basically only /api/rest_v1 doing this [17:07:21] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3730152 (10RobH) Ok, I'll reimage, I'm also doing the conversion to bastion profile. (Unless Brandon tells me otherwise, I'm also making all references on this new server be bast4002, since bast4001 will... [17:07:32] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#3730153 (10RobH) [17:14:34] and, no Content-Length... [17:14:59] - 1509642844.220266 86.592459 0.046567 POST http://sk.wikipedia.org/api/rest_v1/transform/wikitext/to/html HTTP/1.1 200 [17:15:19] (first field would be CL: `varnishncsa -q 'Timestamp:Resp[2] > 30.0' -F '%{Content-Length}o %{VSL:Timestamp:Resp}x %r %s'`) [17:22:38] * ema stops insinuating things and calls it a day [17:23:26] oh one last thing: that varnishcnca query does not work on v5 [17:23:26] Unknown log tag: Timestamp:Resp [17:23:41] because why should it, right? [17:31:25] :) [17:31:45] so, your 120/240s + REST, this reminds me of something we ran into before and worked around? [17:31:54] I think it sounds like the original cause of our extrachance patch? [17:32:15] (wasn't REST tripping extrachance over and over with minutes between chances, due to some timeout behind it with parsoid?) [17:48:36] ema, bblack: I built new openssl 1.1 packages for jessie-wikimedia and uploaded them to apt.wikimedia.org, smoketest against pinkunicorn went fine [17:48:47] I'll update our 1.0.2 packages tomorrow [17:59:10] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-eqiad: connect second interface for each frack to opposite switch for each eqiad host - https://phabricator.wikimedia.org/T176975#3643120 (10Cmjohnson) the 2nd interfaces are connected, updated the switch descriptions, I did not enable the ports. [18:00:25] moritzm: awesome [18:23:50] ema: I made some design changes and updates to https://grafana.wikimedia.org/dashboard/db/tls-ciphers + https://grafana.wikimedia.org/dashboard/db/tls-ciphersuite-explorer [18:24:31] tl;dr is the former now has fewer panels doing the seemingly-useful summary stuff, the latter has all the deep-dive details. Both now source the legacy+new tls data. [18:25:01] the summaries on tls-ciphers have both datasets overlaid for continuity (not perfect, but works?), the latter has the raw data from both datasets if you're trying to look across the history boundary. [18:25:13] I think with this, we can cut off the old data [18:25:32] (with the caveat that I'm still looking into one anomaly where old+new don't line up well, not sure which is "wrong") [19:00:21] bblack: do you think we will see an increase of legacy cipher after turning up Singapore? [19:08:33] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3730584 (10Dispenser) @Keegan You've got less than a fortnight! [19:27:04] XioNoX: not immediately. it's the same users, they're just connecting faster/better. [19:27:46] but over the long term, I tend to believe in the notion that when we drop latency in the region, it will increase readers/editors in the region (basically, less frustration with slowness on our part). [19:28:07] and over that long term, yeah, I think the client mix in asia is generally "worse", so some stats will skew in the not-ideal direction :) [19:41:07] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3730671 (10Jgreen) Note ubuntu/trusty config is fairly different, here's a writeup that worked, I just changed bond-mode to active-backup: https://paulmellor... [23:04:25] ema, bblack, how can I find out varnish's timeout until it closes an innactive TCP session?