[00:01:42] 10Traffic, 10Operations: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3590184 (10BBlack) [00:16:04] https://help.zscaler.com/zia/zscaler-support-tls-1.2 [00:16:15] zscaler is similarly-retarded [00:16:38] their current documentation there basically says they don't support any forward-secret ciphers at all :P [00:17:34] leaving aside the whole nasty business of whether SSL intercept middleboxes/services should even exist.... if you're going to be in the business of mucking with TLS, you could at least track current trends and standards at least as well as $random_open_source_stacks [00:18:58] "Zscaler does not support the following cipher suites due to security or compatibility issues: ..., ECDHE, ..." [00:19:17] (and then they don't even mention anything DHE-related on either list) [00:27:32] if spy proxies are the only significant sources of non-HSTS traffic, we should totally fucking cut them off. delaying security improvements due to professional security violators is way too ironic [00:28:28] really cute how these guys just pur RC4 and ECDHE in the same bin [06:55:51] godog: hi! [06:55:56] https://gerrit.wikimedia.org/r/#/c/376665/1/modules/role/files/prometheus/rules_ops.conf [06:56:22] does this look OK? The queries as currently defined on grafana are: [06:56:42] sum(rate(node_ipvs_backend_connections_active{instance=~"$server:.*"}[5m])) by(local_port,local_address) [06:56:45] sum(rate(node_ipvs_backend_connections_active[5m])) by(instance) [06:58:45] then if I understand the idea correctly we're gonna have to update the dashboards with something like: [06:59:12] local_port_local_address:node_ipvs_backend_connections_active:rate5m{instance="$server:.*"} [07:12:36] bblack: still planning on doing this? https://gerrit.wikimedia.org/r/#/c/345591/ [07:42:54] 10netops, 10Analytics, 10Operations, 10User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3590602 (10elukey) The next step is to design and add the `analytics-in6` filter to cr1/cr2 eqiad, but I would wait for kafka1012-1022 to be decommissioned before that. Those h... [08:17:07] ema: reviewed, yeah then the dashboard will need to be updated to what you mentioned [08:28:43] ema: I'm having trouble making sense of the data we have in graphite for varnish (https://goo.gl/62vqCz) [08:29:24] on that graph, the GET rate is way larger than the sum of all *xx responses rate... do you know what I am missing? [08:37:30] gehel: interesting [08:37:43] godog: thanks! [08:37:50] yeah, very much so ! [08:38:20] I suddenly realized that traffic on maps is most probably way less than what I thought (confirmed our analyst). [08:39:34] gehel: the counts seem correct [08:39:39] https://graphite.wikimedia.org/render/?width=586&height=282&_salt=1504859896.799&target=varnish.eqiad.backends.be_kartotherian_svc_eqiad_wmnet.2xx.count&target=varnish.eqiad.backends.be_kartotherian_svc_eqiad_wmnet.GET.count [08:40:25] but what does the count represents then? I would expect a count to be a monotonic counter... [08:43:48] those stats are from statsd IIRC, count there is reset at each aggregation interval, not monotonic [08:44:44] ok, so count is a rate (events per aggregatio nperiod) [08:45:04] looks like :) [08:45:22] as I remember, the rate exposed by statsd should be the same, just normalized to per second [08:45:36] yes those stats are generated by varnishreqstats and sent to statsd [08:45:45] gehel: yeah I think so too [08:45:54] Oh no, the events are probably not send as +1 increments, so the rate might be something which does not make actual sense... [08:46:19] and they look like this: varnish.eqiad.backends.be_kartotherian_svc_eqiad_wmnet.2xx:1|c [08:46:26] gehel: there is some documentation re: that whole naming snafu https://wikitech.wikimedia.org/w/index.php?title=Graphite#Extended_properties [08:47:04] going afk for a bit, bbl [08:50:09] godog: Oh, thanks! [08:51:03] godog: actually, I'm even more confused now... According to that doc, lower is "The lowest single increment in this interval" (idem for upper). So I would expect in this case to get lower=1 upper=1 (we should do single increments). [08:51:13] Or is there an aggregation going on before statsd? [08:51:53] And if we have an aggregation before statds, than the rate should be used, the count will not reflect the reality we are traing to measure [08:52:02] s/traing/trying/ [08:52:08] * gehel is probably still missing something [08:56:33] gehel: afaik there's no aggregation before statsd no, upper and lower are not 1 e.g. for the metric that ema pasted above? [08:57:17] nope, not at all [08:57:34] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1504861045.955&target=varnish.eqiad.backends.be_kartotherian_svc_eqiad_wmnet.GET.lower&target=varnish.eqiad.backends.be_kartotherian_svc_eqiad_wmnet.GET.upper [08:59:19] that's the .GET not .2xx, not sure how the former gets sent [08:59:36] gehel: sorry I can't debug this further now :( [09:00:11] no problem, I'll keep poking at it if I have some time and I'll rely on Bearloga's data for now... [09:16:39] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3590739 (10fgiunchedi) FTR: username is root and password is the management password. I checked the webui for ps1-a2-eqiad and the current/voltage readings are... [09:37:48] gehel: varnish.eqiad.backends.be_kartotherian_svc_eqiad_wmnet.GET:59|ms [09:38:02] this is how GET looks like, which should explain the situation I guess [09:38:22] Oh, so it's a timer, not a counter! Makes more sense now [09:38:38] thanks! [09:38:44] so that's probably why "count" is the same? [09:39:50] and I should be using sample_rate instead [09:40:04] thanks a lot! [09:41:00] np! [13:45:33] ema: re maps a/a, it's really whenever the applayer is comfortable with it [14:03:26] bblack: right, I've just seen the patch and was wondering if it's still something we plan on doing [14:04:12] up to gehel :) [14:04:41] ema, bblack: I added a comment on the patch [14:05:19] I'd love to do that, but given the state of the maps team, it is not trivial to find some time to check that codfw is doing as well as it should before sending traffic there. [14:05:35] I have a note to bring that to the table in our next standup [14:05:59] gehel: nice, thanks! [14:06:12] yeah... sorry for the delay :( [14:08:02] 10Traffic, 10Operations: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3591573 (10Suhadakashter) [14:08:59] 10Traffic, 10Operations: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3590184 (10Suhadakashter) [14:09:28] 10Traffic, 10Operations: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3591614 (10Reedy) 05duplicate>03Open [14:12:02] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3591637 (10fgiunchedi) I checked with @mark and the current readings are per-phase, since we're using "3 Wye" phase configuration and a server will consume from... [14:12:47] http://frankdenneman.nl/2016/07/06/introduction-2016-numa-deep-dive-series/ [14:13:04] (from the famous dropbox traffic post) [14:14:54] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3591653 (10fgiunchedi) The graphs for codfw and eqiad are also reported here (stacked) https://grafana.wikimedia.org/dashboard/db/site-power-usage [15:20:57] I didn't realize before, but our mdoern jessie "top" has numa-awareness too [15:21:11] the 1 hotkey I knew from before split out cpu core views [15:21:20] 10Traffic, 10Operations, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591837 (10Johan) A couple of discussions: [[ https://en.w... [15:21:31] the "2" hotkey shows numa node lines at the top, and the "3" hotkey will do per-core stats limited to a single given node [15:21:40] %Cpu(s): 1.8 us, 1.1 sy, 0.0 ni, 95.6 id, 0.1 wa, 0.0 hi, 1.4 si, 0.0 st [15:21:43] %Node0 : 2.7 us, 1.7 sy, 0.0 ni, 92.8 id, 0.0 wa, 0.0 hi, 2.9 si, 0.0 st [15:21:46] %Node1 : 0.7 us, 0.5 sy, 0.0 ni, 98.6 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st [15:21:59] nice! good to know [15:22:14] so far nothing has fallen apart with cp4021 pooled into ulsfo's upload cluster [15:22:22] will let that sit a bit before trying a text node too [15:39:21] some basic numa stat comparisons on 4021 vs 4005, the numa hit ratio on node0 (where network/frontend is) goes from 86% hit to 99.2% hit. and numa foreign access drops by an order of magnitude [15:39:27] so pretty much as expected I think [15:40:09] they're very different hardware of course, but the numa foreign access and hit/miss ratios should still be comparable between the old nodes' "pretend it's UMA when it's not" and 4021's NUMA-isolated config. [15:51:12] 10Traffic, 10Diamond, 10Operations, 10monitoring, and 2 others: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3591908 (10akosiaris) 05Open>03Resolved Patches created and merged. Got a basic dashboard working at https://grafana.wikimedia.org/dashbo... [16:04:43] hey look, Alex made a new grafana dashboard for DNS [16:04:44] https://grafana.wikimedia.org/dashboard/db/dns-recursors?orgId=1 [16:05:02] (akosiaris closed T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors as Resolved.) [16:05:03] T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600 [16:06:05] 10Traffic, 10Operations, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591939 (10BBlack) Thanks! So far, I haven't heard of any... [16:07:26] mutante: awesome [16:07:58] I wonder what's up with the hydrogen/chromium numbers. I can see how they might have the given unreasonably-high query rates (some spammy software somewhere in eqiad), but then the hit/miss counters don't add up to those numbers... [16:09:01] hmm hit/miss and query count don't add up right for the others, too [16:09:09] I think one is in units of /sec and the other /min [16:09:15] (queries/sec, hitmiss/min) [16:09:36] err, other way around heh [16:09:40] queries/min, hitmiss/sec [16:11:14] questions over IPv6 is always 2.000 and never anything else, but also not 0 [16:23:09] probably monitoring probes but not live traffic? [16:23:22] we define v6 resolver addrs, but we don't configure them for use anywhere I think [16:28:18] oh, yea, that sounds pretty likely.. monitoring probes [16:28:24] aha [16:37:42] 10Traffic, 10Operations, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3591981 (10Johan) To be honest, most of the community is b...