[12:43:17] paravoid: picking up the thread from yesterday: if the singular connection check to each cache node (which checks the SAN list for unified) uses --no-sni, doesn't that pretty much cover everything simpler? [12:43:39] sorry, left for dinner last night [12:43:43] yes it should [12:43:51] already +2 the new PS :) [12:54:47] thanks! [13:12:16] ok the status listing all the SANs is probably overkill :) [13:12:28] I'll move that to verbose [13:14:17] it's already there in verbose heh [13:16:04] yeah heh [13:16:25] once you're done I'll also merge the ECDSA change for sslxNN for posterity [13:17:48] half of it, I stole under a different changeid [13:17:59] (the check_ssl half) [13:18:02] yeah, I rebased already [13:18:04] ok [13:18:17] and git did the three-way merge automagically :) [13:18:34] to circle back to something you said yesterday [13:18:50] I very much doubt there is any check_ssl out there that checks all these things we're checking [13:18:57] like ECDSA/RSA for instance [13:19:00] or the OCSP stuff now [13:19:42] we could upstream it [13:20:15] of course we're also implicitly depending on jessie's Net::SSLeay + IO::Socket::SSL versions, could/should be explicit about that in the module imports [13:51:10] what's with the 5xx alerts/ [13:51:15] WARNING: 100.00% of data above the warning threshold [250.0] [13:51:24] for both esams and ulsfo, for esams for > 7h [13:54:47] oh I didn't notice them as they're warnings (not errors) [13:54:47] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=esams&var-cache_type=All&var-status_type=5&from=now-7d&to=now [13:54:50] nothing out of the ordinary I'd say? [13:56:29] yeah something's been amiss with those graphite threshold checks this week [13:56:38] I saw one for esams that persisted for hours, but was nothing [13:57:05] it could be that the background 500 rate on swift thumb requests is getting high, but it's not the kind of spiky pattern we expect to trip that "smart" alert [16:56:37] godog: greetings! Can we close T147424 or is there more work to be done? [16:56:37] T147424: Port varnish metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147424 [17:02:58] hey ema, still pending full deployment to all sites and cache roles [17:04:45] godog: ok, let me know if/how I can help. Also, how did you want to proceed with the vhtcpd exporter (implementation-wise)? [17:06:34] ema: sure! if you'd like to help to complete the deployment it is essentially a matter of extending to esams/eqiad as in https://gerrit.wikimedia.org/r/#/c/316742/ [17:06:58] ema: ditto for text and upload, and add those to prometheus like in https://gerrit.wikimedia.org/r/#/c/315098/6/modules/role/manifests/prometheus/ops.pp [17:10:26] godog: nice, I'll give it a go tomorrow then [17:13:36] ema: sweet, thanks! feel free to add me to the reviews as FYI, I'll have some timezone lag anyways [17:14:44] re: vhtcpd exporter, it should be reasonably easy to use the python prometheus client and parse/export the json [17:15:04] I have sth like that for hhvm for which I need to push out a code review