[10:37:55] <ema>	 real hitrate figures for text@ulsfo (fully converted to ats-be) looked grim 
[10:38:57] <ema>	 it turns out we were considering X-Cache with 'miss' anywhere as 'miss', even though we had a frontend pass, which was particularly bad in the case of ATS which does not distinguish between miss and pass
[10:39:21] <ema>	 fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/547716/, graphs look much better now
[10:39:38] <ema>	 see eg: https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&var-cluster=cache_text&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1572603245441&to=1572604765454
[10:40:42] <paravoid>	 nice!
[10:42:03] <ema>	 hi paravoid! :)
[10:42:06] <paravoid>	 hi :)
[10:42:31] <paravoid>	 it's still not up to the levels of ~2 weeks ago, is that expected?
[10:46:11] <ema>	 so, you really have to compare with the same day/time
[10:46:25] <ema>	 the difference between day and night is huge for instance
[10:48:02] <ema>	 on 2019-10-02 @ 10:40:00 we had: frontend 56.6% local 64.6% overall 66.5%
[10:48:51] <ema>	 which is a bit less than right now: frontend ~63, local ~70, overall ~70 
[10:52:44] <ema>	 there's quite some difference between days too though, it's true that on 2019-10-04T10:40 we had higher hitrates compared to right now (overall ~77%)  
[10:57:32] <ema>	 at any rate (ha!) welcome back to our graphs, pass traffic
[10:57:47] <paravoid>	 https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&var-cluster=cache_text&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1569187175964&to=1572604870546
[10:58:24] <paravoid>	 there's a visible drop in two phases, but I suppose it may be an artifact of that issue above  :)
[10:59:15] <ema>	 very likely the reimages of text hosts to ATS
[11:00:39] <ema>	 and yeah you can tell it's an artifact of the pass issue above by looking at the shrinking of 'pass' and corresponding growing of 'miss' in the 'Disposition' graph
[11:02:44] <paravoid>	 ack
[11:02:53] <paravoid>	 interesting stuff, thanks :)
[11:04:12] <ema>	 yes! I've spotted this by waiting for the hitrate comparison with last week to catch up after eqsin reimages to ats -> https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?refresh=15m&orgId=1&var-cluster=text&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5
[11:04:21] <ema>	 and it wasn't really catching up :)
[11:24:08] <ema>	 mmh we're also getting significantly more requests to ulsfo compared to last week
[11:24:29] <ema>	 2.7K right now, ~400 last week
[14:07:21] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5011.eqsin.wmnet'] ` The log can be found in `/var/log/wm...
[14:53:34] <wikibugs>	 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon)
[15:00:24] <wikibugs>	 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5011.eqsin.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019110...
[15:17:55] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) >Hmm maybe it could make sense to store some TLS data like the ciphersuite, version or elliptic curve as integers assuming th...
[15:32:53] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) @Nuria - what you're asking for is something like a combined TLS field with separators? e.g. we contruct a 4-part semicolon-...
[15:43:39] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack Varnish would send items to varnishkafka similar to how it is done for x-nalytics: https://github.com/wikimedia/puppe...
[15:56:57] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) We probably don't need to send the reused value (it's not that useful for analysis at this level, IMHO), and we don't need t...
[15:57:48] <wikibugs>	 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5011.eqsin.wmnet'] `  and were **ALL** successful.
[16:08:26] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) >Also I'm assuming from how X-Analytics is set up that the format is k1=v1;k2=v2;.... (equal sign rather than colon). yes, co...
[16:11:49] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Nuria)
[16:14:47] <wikibugs>	 10netops, 10DC-Ops, 10Operations: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) A lot of back and forth with Juniper, current status is: test_consistency fails 70 test_missing_device_from_installed_base 4 test_missing_inventory_from_installed_base 183  Email...
[16:31:57] <ema>	 bblack: hello!
[16:32:12] <ema>	 so, there seems to be something strange going on with cp5011, which I've reimaged to ats today
[16:32:26] <ema>	 after reboot, some services do not start up as expected
[16:33:25] <ema>	 in particular: the nagios agent, vhtcpd, varnish-frontend (and its dependencies like all the varnishkafka stuff) do not start properly
[16:33:38] <ema>	 varnish-frontend fails with:
[16:33:40] <ema>	 > Nov 01 16:22:54 cp5011 varnish-frontend[1652]: Backend host '"cp5012.eqsin.wmnet"' could not be resolved
[16:34:03] <ema>	 I vaguely remember an issue related to broken name resolution at startup?
[16:34:45] <ema>	 manually running puppet on the host after boot fixes everything, as in, all services are started correctly
[16:38:55] <ema>	 I've pooled the host now as things look good, I'll investigate the reboot issues next week! 
[16:38:55] <moritzm>	 yeah, I remember a pre-existing issue about that on, but it was for the authdnses I think
[16:40:08] <moritzm>	 if we want to force a systemd unit to only start when name resolution is up, it needs a "After=nss-lookup.target"
[16:40:17] <moritzm>	 that's what we added to ferm.service
[16:40:24] <ema>	 interesting
[16:41:14] <ema>	 I'll give it a go on Monday, thanks moritzm!
[16:41:34] <moritzm>	 you can have systemd-analyze generate a Graphviz .dot file which prints the execution graph on startup
[16:41:48] <moritzm>	 but should help narrow it down
[16:42:27] <moritzm>	  should help narrow it down
[16:42:38] <moritzm>	 "that should help narrow it down" I meant
[16:50:15] <bblack>	 "After=nss-lookup.target" seems troubling.  this is the kind of thing old-school runlevels handled well :P
[16:50:32] <bblack>	 you'd think the default case for most runtime daemon would be to depend on basics like network+dns
[16:50:50] <bblack>	 do we need to go add this to a bunch of things?
[16:52:38] <moritzm>	 most daemons depend on network.target, but the manpage already mentions that it's weakly defined...
[16:52:56] <bblack>	 (for that matter, in theory traditionally if network was up, DNS resolution was implicitly up.  what special things are now involved in nss-lookup?
[16:53:03] <bblack>	 )
[16:53:23] <bblack>	 because systemd has some local daemon in the path ala nscd?
[16:53:40] <bblack>	 (which shouldn't block direct libs resolver lookups anyways if it's not up, you'd think)
[16:53:49] <bblack>	 s/libs/libc/
[16:54:51] <moritzm>	 nss-lookup.target is documented as the sync point for all host/network service lookups, but I'd need to look up what part of our bootup enables that
[16:55:19] <moritzm>	 systemd-networkd will probably fill it in, but it must be also provided when booting without it
[16:55:38] <moritzm>	 It's probably worth looking a bit closer on cp5012 which component was failing
[16:55:56] <moritzm>	 for almost all cases network.target vs. nss-lookup.target should make no material difference
[16:56:26] <moritzm>	 but maybe one of the ATS units which was trying to start missed either of the two
[16:57:22] <moritzm>	 the list of targets is documented in systemd.special(7), but I think it was evolving over time, so older units may miss some features which were added later
[16:57:50] <moritzm>	 and upstream units were probably conversative in adding features not available in current Debian/RHEL/SuSE releases as well
[19:28:13] <mutante>	 alert on cp2011 because trafficserver is running 3 times / or rather there are 3 processes matching "/usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog". common issue with process checks to check for exactly 1 and then 3 (during restarts or something) means alerts. But sometimes it's actually something to fix.
[19:28:38] <mutante>	 yea, and it's already back. it's the common issue during restarts
[19:28:56] <mutante>	 might want to change that icinga syntax to 1:3  or so
[21:38:58] <XioNoX>	 bblack: any idea on top of you mind of what could have caused this to start? https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=18&fullscreen&edit&tab=alert&orgId=1&from=now-15d&to=now
[21:39:32] <XioNoX>	 it's only about the recursors, and matches with the end of esams week
[21:40:07] <XioNoX>	 asking before I start digging
[21:49:12] <bblack>	 XioNoX: legit tcp traffic to the recrusors is tiny and mostly monitoring / health checks.  So, in percentage terms, it doesn't take much anomaly to spike that graph.
[21:49:27] <bblack>	 IIRC we've looked at this stuff in the past, and it was pybal
[21:49:47] <bblack>	 (because we have pybal monitor recdns by making a content-free tcp connection which then times out, or whatever)
[21:51:04] <bblack>	 we did replace the recursors in esams, and the pybal lvses that monitor them
[21:51:21] <bblack>	 not sure why all the rest change up at the same time
[21:53:31] <XioNoX>	 mostly wondering why it started to spike a week ago
[21:53:45] <bblack>	 yeah I donno
[22:40:58] <wikibugs>	 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Jclark-ctr)