[10:37:55] real hitrate figures for text@ulsfo (fully converted to ats-be) looked grim [10:38:57] it turns out we were considering X-Cache with 'miss' anywhere as 'miss', even though we had a frontend pass, which was particularly bad in the case of ATS which does not distinguish between miss and pass [10:39:21] fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/547716/, graphs look much better now [10:39:38] see eg: https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&var-cluster=cache_text&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1572603245441&to=1572604765454 [10:40:42] nice! [10:42:03] hi paravoid! :) [10:42:06] hi :) [10:42:31] it's still not up to the levels of ~2 weeks ago, is that expected? [10:46:11] so, you really have to compare with the same day/time [10:46:25] the difference between day and night is huge for instance [10:48:02] on 2019-10-02 @ 10:40:00 we had: frontend 56.6% local 64.6% overall 66.5% [10:48:51] which is a bit less than right now: frontend ~63, local ~70, overall ~70 [10:52:44] there's quite some difference between days too though, it's true that on 2019-10-04T10:40 we had higher hitrates compared to right now (overall ~77%) [10:57:32] at any rate (ha!) welcome back to our graphs, pass traffic [10:57:47] https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&var-cluster=cache_text&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1569187175964&to=1572604870546 [10:58:24] there's a visible drop in two phases, but I suppose it may be an artifact of that issue above :) [10:59:15] very likely the reimages of text hosts to ATS [11:00:39] and yeah you can tell it's an artifact of the pass issue above by looking at the shrinking of 'pass' and corresponding growing of 'miss' in the 'Disposition' graph [11:02:44] ack [11:02:53] interesting stuff, thanks :) [11:04:12] yes! I've spotted this by waiting for the hitrate comparison with last week to catch up after eqsin reimages to ats -> https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?refresh=15m&orgId=1&var-cluster=text&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [11:04:21] and it wasn't really catching up :) [11:24:08] mmh we're also getting significantly more requests to ulsfo compared to last week [11:24:29] 2.7K right now, ~400 last week [14:07:21] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5011.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [14:53:34] 10netops, 10Operations, 10ops-esams: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon) [15:00:24] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5011.eqsin.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019110... [15:17:55] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) >Hmm maybe it could make sense to store some TLS data like the ciphersuite, version or elliptic curve as integers assuming th... [15:32:53] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) @Nuria - what you're asking for is something like a combined TLS field with separators? e.g. we contruct a 4-part semicolon-... [15:43:39] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack Varnish would send items to varnishkafka similar to how it is done for x-nalytics: https://github.com/wikimedia/puppe... [15:56:57] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) We probably don't need to send the reused value (it's not that useful for analysis at this level, IMHO), and we don't need t... [15:57:48] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5011.eqsin.wmnet'] ` and were **ALL** successful. [16:08:26] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) >Also I'm assuming from how X-Analytics is set up that the format is k1=v1;k2=v2;.... (equal sign rather than colon). yes, co... [16:11:49] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Nuria) [16:14:47] 10netops, 10DC-Ops, 10Operations: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) A lot of back and forth with Juniper, current status is: test_consistency fails 70 test_missing_device_from_installed_base 4 test_missing_inventory_from_installed_base 183 Email... [16:31:57] bblack: hello! [16:32:12] so, there seems to be something strange going on with cp5011, which I've reimaged to ats today [16:32:26] after reboot, some services do not start up as expected [16:33:25] in particular: the nagios agent, vhtcpd, varnish-frontend (and its dependencies like all the varnishkafka stuff) do not start properly [16:33:38] varnish-frontend fails with: [16:33:40] > Nov 01 16:22:54 cp5011 varnish-frontend[1652]: Backend host '"cp5012.eqsin.wmnet"' could not be resolved [16:34:03] I vaguely remember an issue related to broken name resolution at startup? [16:34:45] manually running puppet on the host after boot fixes everything, as in, all services are started correctly [16:38:55] I've pooled the host now as things look good, I'll investigate the reboot issues next week! [16:38:55] yeah, I remember a pre-existing issue about that on, but it was for the authdnses I think [16:40:08] if we want to force a systemd unit to only start when name resolution is up, it needs a "After=nss-lookup.target" [16:40:17] that's what we added to ferm.service [16:40:24] interesting [16:41:14] I'll give it a go on Monday, thanks moritzm! [16:41:34] you can have systemd-analyze generate a Graphviz .dot file which prints the execution graph on startup [16:41:48] but should help narrow it down [16:42:27] should help narrow it down [16:42:38] "that should help narrow it down" I meant [16:50:15] "After=nss-lookup.target" seems troubling. this is the kind of thing old-school runlevels handled well :P [16:50:32] you'd think the default case for most runtime daemon would be to depend on basics like network+dns [16:50:50] do we need to go add this to a bunch of things? [16:52:38] most daemons depend on network.target, but the manpage already mentions that it's weakly defined... [16:52:56] (for that matter, in theory traditionally if network was up, DNS resolution was implicitly up. what special things are now involved in nss-lookup? [16:53:03] ) [16:53:23] because systemd has some local daemon in the path ala nscd? [16:53:40] (which shouldn't block direct libs resolver lookups anyways if it's not up, you'd think) [16:53:49] s/libs/libc/ [16:54:51] nss-lookup.target is documented as the sync point for all host/network service lookups, but I'd need to look up what part of our bootup enables that [16:55:19] systemd-networkd will probably fill it in, but it must be also provided when booting without it [16:55:38] It's probably worth looking a bit closer on cp5012 which component was failing [16:55:56] for almost all cases network.target vs. nss-lookup.target should make no material difference [16:56:26] but maybe one of the ATS units which was trying to start missed either of the two [16:57:22] the list of targets is documented in systemd.special(7), but I think it was evolving over time, so older units may miss some features which were added later [16:57:50] and upstream units were probably conversative in adding features not available in current Debian/RHEL/SuSE releases as well [19:28:13] alert on cp2011 because trafficserver is running 3 times / or rather there are 3 processes matching "/usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog". common issue with process checks to check for exactly 1 and then 3 (during restarts or something) means alerts. But sometimes it's actually something to fix. [19:28:38] yea, and it's already back. it's the common issue during restarts [19:28:56] might want to change that icinga syntax to 1:3 or so [21:38:58] bblack: any idea on top of you mind of what could have caused this to start? https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=18&fullscreen&edit&tab=alert&orgId=1&from=now-15d&to=now [21:39:32] it's only about the recursors, and matches with the end of esams week [21:40:07] asking before I start digging [21:49:12] XioNoX: legit tcp traffic to the recrusors is tiny and mostly monitoring / health checks. So, in percentage terms, it doesn't take much anomaly to spike that graph. [21:49:27] IIRC we've looked at this stuff in the past, and it was pybal [21:49:47] (because we have pybal monitor recdns by making a content-free tcp connection which then times out, or whatever) [21:51:04] we did replace the recursors in esams, and the pybal lvses that monitor them [21:51:21] not sure why all the rest change up at the same time [21:53:31] mostly wondering why it started to spike a week ago [21:53:45] yeah I donno [22:40:58] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Jclark-ctr)