[00:31:25] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4008801 (10ayounsi) [00:41:42] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4008830 (10ayounsi) Added the table to the description with current options. Servers are racked in rack 7 of each rows. 1013=A7, 1014=B7, 1015=C7, 1016=D7. The easiest to do... [00:42:23] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4008831 (10ayounsi) [02:40:07] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4008993 (10Tgr) [02:42:23] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4008997 (10Tgr) [02:50:05] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4009008 (10Tgr) As @Anomie noted there, {T186965} is fixed (or will be once the patch is merged) for wikis using Remex but not for ones u... [03:41:45] 10Traffic, 10DNS, 10Operations: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4009069 (10Dzahn) Is there a specific thing you want to achieve with this move? I can do this but would be nice to have on the ticket a tiny bit of rea... [03:51:39] 10Traffic, 10DNS, 10Operations: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4009072 (10Prtksxna) [03:52:13] 10Traffic, 10DNS, 10Operations: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4004707 (10Prtksxna) >>! In T188362#4009069, @Dzahn wrote: > Is there a specific thing you want to achieve with this move? I can do this but would be n... [03:56:46] 10Traffic, 10DNS, 10Operations: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4009079 (10Dzahn) Thank you! I'll take this and it should not be a problem . Regarding the private ticket i don't have permissions to read it yet it se... [10:31:08] so puppet is failing on lvs1007 because of `ethtool -L eth0 combined 16`: [10:31:15] Cannot set device channel parameters: Invalid argument [10:32:40] ethtool -l eth0 says that the 'Pre-set maximum' for Combined is 15, so perhaps we should ensure that ${num_queues} is not higher than the maximum allowed in modules/interface/manifests/rps.pp? [10:35:02] yeah eth[01] on lvs1007 seem the only case where the Pre-set maximum for combined is less than the number of physical cores (15 vs 16) [11:36:09] godog: so, vgutierrez is currently working on a icinga check to alert us whenever pybal does not have a bgp session in state established [11:36:34] the approach followed so far is to check for the relevant prometheus metric: https://gerrit.wikimedia.org/r/#/c/415260/3/modules/pybal/files/check_pybal_bgp_sessions.py [11:37:00] I'm wondering if we should instead: [11:37:20] a) come up with a generic icinga check to alert whenever a prometheus metric is/isn't within a certain range [11:37:26] fwiw my quick pass was only on the python side, I was missing the detail for the rest ;) [11:37:35] ema: we already have it [11:37:48] b) add a /bgp endpoint to pybal's instrumentation and expose the info there [11:38:42] volans: oh, check_prometheus_metric.sh [11:38:53] .py ;) [11:39:01] has been rewritten recently by filippo [11:39:28] nice one [11:42:12] vgutierrez: so yeah, perhaps give a shot to ./modules/prometheus/files/usr/local/lib/nagios/plugins/check_prometheus.py ? [11:42:35] I was checking it [11:45:04] so yeah I need to actually replace the check with https://gerrit.wikimedia.org/r/c/413142/ [11:45:15] which I'll likely do later today, FWIW [12:02:55] we'll need to get in another way if BGP is enabled or not for that pybal instance, but besides that, I guess that we could check directly if pybal_bgp_session_established is 1.0 or not [12:05:06] so I missed much of this discussion [12:05:11] but why wouldn't that be exposed as another metric? [12:05:38] that was the other approach proposed by ema [12:06:09] besides sending the data to prometheus, create a /bgp endpoint in the instrumentation interface and check based on that [12:07:25] yeah we've added a few other prometheus metrics that are not strictly metrics but expose configuration [12:07:29] (for example, depool_threshold) [12:07:34] because that's very convenient for monitoring [12:11:54] we're only exposing if BGP is enabled or not, everything else are metrics: (BGP sessions established) + (FSM state for every BGP peer) [12:12:03] yeah [12:12:25] yeah I think we should definitely expose both the bgp connection state and whether it's enabled or not to prometheus [12:12:32] my comment was about the icinga check itself [12:13:11] about the icinga check.. the only question is if we want to have a strong dependency on prometheus [12:13:14] I don't know the internals, but technically you could use -1 as disable in the number of bgp sessions [12:13:36] *disabled [12:14:27] well most of what we expose on pybal's own instrumentation interface could be exposed via prometheus instead, arguably [12:15:22] so you'd have something like pybal_bgp_session_established -1.0 when BGP is disabled and pybal_bgp_session_established{asn="64496",peer="10.192.16.140"} 1.0 when it's enabled (and the BGP session is established) [12:15:51] something like that, yes [12:15:51] that could do the trick as well [12:16:10] and we could leverage check_prometheus_metric.py for that [12:16:37] ema: sure but is harder to do a mixed check, the prometheous checks are done on the Icinga hosts querying directly prometheus, adding a call to specific hosts for the pybal instrumentation might need some refactoring [12:18:49] volans: we might be saying the same thing, I was thinking that we could get rid of pybal's instrumentation altogether and expose pool info via prometheus instead :) [12:19:03] ahhh sorry thought the opposite :D [12:19:27] BTW, how the prometheus check filters based on the host pushing the metric? [12:19:54] I guess that's a filter on the prometheus query itself [12:21:08] yeah usually is how it's done, using ${::hostname} as part of the query on the puppet side of it [12:22:00] I'm sure what I'm proposing is not kosher, but: how about running the check against pybal's prometheus endpoint locally instead of querying the prometheus server? [12:23:19] that's not possible I'm afraid [12:23:36] the check requires prometheus API [12:23:55] s/not possible/not implemented/ [12:23:57] I mean, check_prometheus.py uses the query API [12:24:35] sure, but that would be a completely different check [12:24:36] ema: for single metrics yeah that's possibile, connect and parse + check the values [12:24:45] for actually using the query language no [12:24:47] also you couldn't apply any function [12:25:01] to mangle the data [12:25:28] so, as an aside [12:25:37] i'm very happy about more prometheus metrics of course, now bgp [12:25:44] but i'm not very happy with bgp.py getting tied to pybal infra [12:26:06] ideally prometheus metrics should be sent/implemented in pybal code, not in the bgp classes [12:26:06] do you see it as an external module? [12:26:15] yes, it very much was [12:26:21] it was just included in pybal for convenience once [12:26:27] (and because I never properly finished it) [12:26:34] but it was never meant for pybal in the first place [12:26:40] a lot of bgp.py code pybal doesn't even use [12:27:12] godog: more in general, I guess this depends on what the philosophy of prometheus really is. Are the various exporters also to be considered as a local API to extract information (perhaps consumed by something who is not prometheus), or is that considered bad practice? [12:28:09] also, all this raises the question of how much pybal should be tied/dependant to our infrastructure [12:28:32] we made the new metrics a bit "generic" [12:28:41] it uses the prometheus client but could be implemented using something else [12:28:44] ema: personally I don't see it as a bad practice if you're polling specific values, in particular if called locally [12:30:05] mark: some callbacks could be implemented on the BGP side to monitor state changes in the FSM and in BGPPeering, so the especific pybal details could live inside bgpfailover.py instead in bgp/bgp.py [12:30:09] exactly [12:30:12] :) [12:30:30] same thing happened with logging btw [12:30:45] bgp.py used simple print statements, thinking you could always redirect that [12:30:47] a bit crude [12:30:50] now it uses pybal's logging interface [12:31:06] which is much better of course, but less independent [12:31:13] bgp.py is tightly tired with twisted, right? [12:31:17] yes [12:31:20] vgutierrez: ok if polling the prometheus exporter for purposes other than prometheus itself is fine and can be encouraged, we're in a happy place without adding anything to pybal's own instrumentation [12:31:25] i call it "twisted bgp" [12:31:27] the problem is twisted, not python logging module ;) [12:31:31] * volans hides [12:31:39] pybal* [12:31:42] mark: so it should use twisted logger instead of print or pybal logging [12:31:52] yeah it could [12:32:04] i realize none of this is high priority [12:32:06] whereas fixing bugs is [12:32:11] just making you guys aware ;) [12:32:17] volans: I agree with you, after a week now I love/hate twisted A LOT [12:32:28] what is up with you guys [12:32:32] i need to make this a standard hiring question [12:32:36] ema: not really a local api no, the idea is to go through prometheu server [12:32:36] twisted is beautiful [12:32:36] hhahhahaha [12:32:38] rotfl [12:33:08] it's showing its age a bit [12:34:19] so, just for quick context: i wrote bgp.py as a generic twisted bgp module, but i mostly wrote it for analyzing (incoming) bgp route table info [12:34:27] pybal does the opposite, it just sends routes [12:34:40] i happened to have code that could do that (naively ;-) so I glued them together [12:35:12] 10 lines of code and suddenly pybal had bgp support [12:41:40] then at some point we should put some efforts in making bgp.py independent again [12:41:55] now we need to close bugs and gain visibility on what's going on with the BGP side of pybal [12:42:46] yup, just keep that in mind [12:43:01] and I'm the new guy here, so I don't have an strong opinion in going in one way or another.. so... :9090/metrics or Prometheus query API? :) [12:43:36] i don't really see any advantage to going with a proprietary pybal interface tbh [12:43:51] maybe if the prometheus metrics api format isn't flexible enough [12:43:58] hmmm actually :9090/metrics is prometheus already [12:44:05] it is that endpoint yes [12:44:16] (https://github.com/prometheus/client_python/blob/master/prometheus_client/twisted/_exposition.py) [12:44:16] but the rest on that port is proprietary (I think?) [12:44:20] indeed [12:44:27] giuseppe wrote that [12:46:00] cache_text @ codfw upgraded [12:46:03] time to eat now! [12:46:26] i think i'll continue writing unit tests for coordinator.py today [12:46:34] so if we're going to use prometheus to trigger alerts, the easiest way was the one suggested by volans, set a negative value when BGP is disabled and leverage check_prometheus.py [12:46:36] i wrote a bunch on the plane to SF last month [12:47:05] vgutierrez: +1 [12:47:34] and at some point, move everything to prometheus and deprecate :9090 [12:47:43] or just keep the :9090/metrics endpoint [12:47:51] and get rid of the propietary part [12:48:11] i haven't looked at what format the proprietary part uses [12:48:15] it may not be suitable to prometheus [12:48:19] it may not just be metrics [12:48:20] but I don't know :) [12:51:34] * mark vaguely recalls the last unit test function he wrote wasn't complete because breakfast was being served on the plane [13:10:18] pybal/coordinator.py 106 3 97% 239-241 [13:10:20] getting there [13:51:55] ema: re lvs1007 combined limit (a) that case doesn't matter anymore since they're decomming and (b) if we do deploy hardware where the bnx2x "combined" limit is the limiting factor, we'll probably have to do something more-complicated in general in the future, so it might be good to leave the failure in place so we notice it on the next future host it happens on. [13:58:01] ValueError: Timeseries already present in CollectorRegistry: pybal_bgp_session_established [13:58:21] *sigh* mark looks like I'm moving metrics generation outside bgp.py sooner than expected [13:58:29] (also, now that I stare at that bit, I'm not sure ethtool's report of "Combined" is accurate as to hardware limitations either. The bnx2x boot-time param may cause a lower limit there that can later be evaded by reboot...) [13:59:18] vgutierrez: well you can just define those metrics once at the top of bgp.py or such [13:59:29] hmm right [14:03:32] so one idea for callbacks is to let pybal provide an instrumented FSM derived from bgp.FSM [14:03:46] but that might not cover all your needs (tcp ports etc...) [14:12:36] pybal_bgp_session_established{asn="None",peer="None"} -1.0 [14:13:22] hm [14:13:32] is it feasible to not even provide those labels? [14:13:36] not sure that's a good idea [14:13:53] once the metric has labels defined you need to submit them [14:14:09] I tried without, but it doesn't work [14:14:22] that's why the current approach in master is cleaner [14:14:28] one metric signaling BGP [14:14:34] pybal_bgp_enabled 1|0 [14:15:03] yeah [14:15:28] but then I cannot leverage check_prometheus.py check :) [14:15:43] unless prometheus allows conditional queries [14:16:52] it has and/or/unless [14:16:55] maybe that works? [14:17:08] vgutierrez: how come you couldn't use check_prometheus.py with pybal_bgp_enabled ? [14:17:26] sorry I missed the previous context heh so might be a dumb question [14:18:05] let's consider this set of metrics [14:18:06] pybal_bgp_enabled 1.0 [14:18:06] pybal_bgp_session_established{asn="64496",peer="10.192.16.140"} 1.0 [14:18:06] pybal_bgp_session_state{asn="64496",local_peer="10.192.16.139",remote_peer="10.192.16.140",side="active",state="ESTABLISHED"} 1.0 [14:18:09] pybal_bgp_session_state{asn="64496",local_peer="10.192.16.139",remote_peer="10.192.16.140",side="active",state="OPENCONFIRM"} 0.0 [14:18:12] pybal_bgp_session_state{asn="64496",local_peer="10.192.16.139",remote_peer="10.192.16.140",side="active",state="OPENSENT"} 0.0 [14:18:31] the check should verify that pybal_bgp_enabled == 1.0 and then that every pybal_bgp_session_established == 1.0 [14:18:44] so pybal_bgp_enabled - there's also enabling per service [14:18:53] and if pybal_bgp_enabled == 0.0 then don't do anything [14:19:08] and report as OK :) [14:19:45] yay [14:19:53] godog: maybe is doable and it's only a layer 8 limitation on my side [14:19:55] pybal unit test coverage just went 'yellow' on coveralls [14:20:56] vgutierrez: lol, yeah I think it is doable, what mark said with and/or/unless [14:21:19] mark: well, as long as one service has BGP enabled, pybal needs one BGP session on state ESTABLISHED for every ASN/peer [14:21:29] yes [14:22:02] godog: oh nice, where can I run some query tests? [14:22:09] but while you're at it, might be nice to also be able to see whether it's enabled per service [14:22:27] although that's more of a pybal_service namespace thing I guess [14:22:42] paybal_service_bgp_failover? [14:23:16] pybal_service_bgp_enabled I guess [14:24:39] vgutierrez: heh so one option is to run prometheus locally, also prometheus in production will pull metrics e.g. from pybal-test [14:25:15] there's also a prometheus server in "beta" (aka deployment-prep) at http://beta-prometheus.wmflabs.org/beta/graph but no pybal there afaik [14:26:39] vgutierrez: i guess the general scheme you could test with any other already existing metrics in pybal production [14:26:57] just pick two boolean/int metrics that don't share any labels, make the query work ;) [14:27:58] yup [14:29:56] btw, https://wikitech.wikimedia.org/wiki/Prometheus#Access_Prometheus_web_interface that's outdated [14:31:04] I get a 404 in /ops and a default apache index in / if I follow that [14:32:52] just use grafana ;) [14:33:02] (i am lazy) [14:33:04] hahahah [14:33:25] vgutierrez: I usually use ssh -N -L 8080:prometheus1003.eqiad.wmnet:80 prometheus1003.eqiad.wmnet [14:34:51] right, that works :) [14:35:38] ugh, the documentation above should work tho [14:35:44] I'll check [14:36:57] ah so localhost vs the hostname for the tunnel, what the [14:37:22] ema: apparently during a normal cron restart of cp1074 varnish-be back on Feb 18th, it never repooled itself (varnishd came up fine, though) [14:37:40] vgutierrez: "fixed" the instructions [14:37:58] ema: probably the final confctl command failed, I guess at least for the cron case, we probably don't get that via any channel except maybe email? [14:38:35] ema: (or not at all, I don't see a cron email either) [14:38:40] anyways, fixed now [14:38:43] godog: thx :D [14:40:36] (on that note though, probably the most-robust solution is we should maybe have an icinga alert from the host itself checking whether its own services are depooled for an extended period of time, somehow. Or something related to that sort of idea) [14:42:14] bblack: you made me thinking if comparing the whole etcd current status with data in contool-data might be a good idea too, with some exceptions maybe [14:43:00] ofc notifying/alarming only if any given value is different from the expected one for a long period of time maybe [14:43:11] volans: yeah, it might even be that in the case of cp1074 above the status on etcd wasn't updated in the first place though [14:43:56] (also, conftool-data defaults everything to pooled=no, and also we do sometimes have very-long-term pooled=no for some reason, but usually there's a ticket I guess) [14:44:36] true, forgot the pooled=no default [14:45:43] godog: BTW, could you check https://gerrit.wikimedia.org/r/#/c/413211/ discussion at some point? it's related the UDP Monitor you asked for in T178151 :) [14:45:43] T178151: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151 [14:45:50] another angle to see this is that the crontab is the wrong tool for the job and a job scheduler might be better and would alert on failure :D [14:46:38] * bblack stabs volans [14:46:50] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&panelId=3&fullscreen&from=now-7d&to=now [14:47:08] ema: ^ v5 seems to have reduced the daily failed-fetch pattern for esams-text [14:47:40] bblack: yeah I was staring at that graph earlier on too \o/ [14:48:03] and when you have time, poke holes in: https://gerrit.wikimedia.org/r/#/c/415204/ [14:48:30] bblack: so, I've ACKed lvs1007's alert now. In the future we do want puppet to fail if something similar happens on another host then, right? [14:48:39] vgutierrez: for sure! I've added myself to the review [14:48:45] ema: correct [14:48:55] ok [14:48:56] vgutierrez: otherwise reviews tend to fall off my radar (JFYI) [14:49:17] (because otherwise, future optimizations might assume the value we picked could be set, and cause e.g. an nginx-vs-ethdriver mismatch that we otherwise wouldn't notice) [14:51:02] bblack: oh, did you find any metric in prometheus relevant to memory accesses served over the QPI? We should probably add that to prometheus-machine-stats for all hosts! [14:52:17] ema: nope. I think you can only get that from CPU performance counters that perf sees. it may not be a reasonable "normal" stat [14:52:57] ema: we do have the numa_(hit|miss|foreign) stuff, but it's counting accesses and hard to interpret globally [14:53:55] so numa_hit=cache hit, numa_miss=local memory access, numa_foreign=other node? [14:54:56] http://blog.tsunanet.net/2011/06/clarifications-on-linuxs-numa-stats.html [14:55:39] interesting [14:57:09] (but also, I think the word "process" is over-used in those descriptions. I suspect the kernel's own allocations are similarly tracked in the same stats, based on where the mem comes from vs where the allocating thread was running at that moment in time) [14:58:02] in general, it's hard to make sense of it usefully, though [15:01:00] as an example, we expect from that cp4021 is healthier in numa terms than cp4023. And several non-numa stats seem to reflect this (e.g. spikiness of memory-usage graphs, etc). [15:01:18] but this is the comparative numastat after a few months of it: [15:01:21] https://phabricator.wikimedia.org/P6756 [15:02:09] so cp4021 (the good numa_networking:on node) has a node0 numa_miss that's ~4 orders of magnitude lower, and a numa_foreign that's ~2 orders higher [15:03:04] then on node1 numa_hit that's ~1 order lower, numa_miss ~2 higher, numa_foreign ~4 lower [15:03:23] and they've got comparable uptime [15:03:40] so you can see some of those numeric shifts seem like wins, and some don't [15:04:46] some of the wins may reflect the nginx+bnx2x numa_networking improvements, some of the losses may reflect other processes (e.g. varnishes) operating a little differently under these conditions in terms of choosing cpus+mems from auto-sched, at hopefully acceptable tradeoffs [15:04:59] but it's hard to just dump the stats and say it's a clear win because X, at this level [15:06:03] the biggest win seems smoother memory usage on cp4021 compared to cp4023 then [15:06:35] (much smoother) [15:09:00] right, which is probably driven by skb allocations and such for incoming traffic (which we set our excessive vm.min_free_kbytes to help accomodate) being confined to node0 where nginx is running, leaving a more-stable state on node1 that's not constantly evicting evictable disk cache pages and whatnot. [15:09:17] but you can only guess really without looking a whole lot deeper :) [15:12:24] in the overall on some level though, it's comparable numbers to say: +numa_hit is a win, and -sum(numa_foreign+numa_miss) is a loss [15:12:44] and if you sum things up that way, it's a considerable net win [15:14:14] IOW: numa_winningness(higher is better) = numa_hit - (numa_foreign + numa_miss) [15:15:53] or something like that [15:16:18] numa_hit/(numa_hit+numa_foreign+numa_miss) ? [15:20:26] starting text@eqiad upgrades [15:21:19] awesome [15:21:30] (yes, the hit/everything formula seems a good indication of winningness AFAIU :)) [15:21:38] * vgutierrez running again to the dentist... see you in a while (hopefully) [15:21:42] well, it's not quite right either :) [15:21:44] cya [15:21:48] good luck vgutierrez! [15:22:05] maybe numa_hit/(numa_foreign+numa_miss) makes a better ratio [15:22:16] there's some variance in the totals, too [15:27:49] either way, even though my "count orders of magnitude" summary seems like a net win, any say hit/miss+foreign sort of ratio gives 4023 the better outcome numbers lol [15:29:23] I suspect the bottom line is that it's hard to make much sense of a total-system stat like this, on a machine like the cp [15:29:23] heh [15:29:47] where there are 3 competing daemons doing different things, which have different impacts we care about, etc [15:29:59] I mean those spikes in memory usage on cp4023 must be reflected in some stats somewhere right? :) [15:30:11] yes, in the memory stats you see the spikes in :) [15:30:20] yeah besides those! :P [15:33:21] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4010716 (10Dzahn) Access is still denied to me on T175445 which is surprising because i'm a member of WMF-NDA and can read securi... [15:36:06] mutante: I can't see it either FWIW [15:37:25] nevermind the irony of a super-private task and repo + separate "-private" domain for Transparency :P [15:40:23] bblack: yea.. "transparency-private" will always be a reason for question [15:40:46] and it's a locked down Gerrit repo with user:pass [15:40:54] which we don't do anywhere else, afaict [15:42:19] maybe it's their equivalent of puppet-private and has some auth creds or something, I donno [15:45:15] yea, i wasn't going to open that entire discussion on that specific ticket yet.. since so far it's just asking to move the same stuff to a different place [17:10:32] bblack: found out the reason why 415204 was failing on cache_misc, see CR :) [17:13:55] yeah I see it :) [17:20:28] ema: fixed-up (had run PCC too, but neglected to include a cache_misc [17:21:30] also I totally expect that patch has multiple levels of puppet-rollout-race-dependency issues with files appearing on fileservers before they're used, possibly reload commands first executing with wrong arguments before the file updates, etc. [17:21:42] but I think I can get past those with a fleet puppet-disable and some care [17:22:11] working on this also made me realize a couple of other corner-cases we haven't dealt with that are hopefully non-issues in practice: [17:23:31] 1) vcl reloads triggered by confd for pooled=X actions (in lots of nodes in response to a given node going depooled for cron'd restart), are not exlcusive with puppet runs that could also be reloading vcl, possibly causing a vcl reload race of some kind [17:25:04] 2) the case for load->use delay for backends becoming live also applies to freshly-started daemons. I think our restart scripts use a 10s delay after start which is enough to cover all cases. I wonder if there are other cases we might miss it when restarting without those scripts. [17:25:45] (maybe it makes sense to put the load->use sleep into the systemd unit itself, as some kind of post-start sleep before systemd considers the unit fully active? [17:25:48] ) [17:27:28] I think that in practice 2) isn't really an issue, given that we usually either use the restart scripts or perform stop/start operations while the service is depooled [17:28:16] perhaps the unit could curl varnish as a sanity check before going "up" though :) [17:31:39] 1) is more scary on paper, I don't think we've ever considered what can happen with concurrent reloads? [17:36:53] well they're never truly concurrent, I think the part of varnishd receiving the CLI commands is basically single-threaded. [17:37:07] so one will load slightly after the other, and one will use slightly after the other. [17:37:18] and maybe some discards fail to discard already-discarded things [17:38:31] right, I was more thinking of "concurrent" in the sense of possible races between multiple vcl.load / vcl.use stepping on each other [17:38:54] since they all generated their own UUID, in theory it should be ok [17:39:20] it may well end up ordered as load1,load2,use2,use1 of course, but at the end of it some valid VCL will be loaded [17:39:32] err, used :) [17:40:03] perhaps not the one you'd expect though :) [17:40:10] the other layer of question is whether this leaves things in a state where it's failed to pick up an important change [17:40:13] yeah [17:41:48] in a effort to make easy things complicated, I'd propose to hash the whole VCL and use that as the UUID [17:41:59] lol [17:43:28] the use statement does reference the UUID though, so that part's not really in question [17:43:55] but the real case is puppet could be applying a functional VCL patch while confd is applying a backend-repool patch. [17:44:10] and the resulting sequential version of events could be: [17:44:41] 1) confd edits directors.vcl [17:44:52] well nevermind that sequence [17:45:22] if both actors are sure to sync VCL file changes to disk before doing a load, the final of the two loads will always contain both changes. [17:45:35] it's just the load/use mixup that's a risk [17:45:55] 1) confd edits directors.vcl [17:46:07] 2) confd loads as uuid=X [17:46:18] 3) puppet edits other.vcl [17:46:30] 4) puppet loads (incl directors change) as uuid=Y [17:46:35] 5) puppet uses uuid=Y [17:46:43] 6) confd uses uuid=X [17:46:59] 7) runtime now continues without the functional change, until some future thing causes another reload. [17:47:22] but it relies on pretty much the above ordering of events to cause the problem [17:47:36] if 1+3 happened before 2, the race would be meaningless [17:48:07] so one of the acts has to (write new VCL files on disk + load + use) during the window between the other's (load+use) [17:48:14] s/acts/actors/ [17:48:33] my not very serious comment about hashing the VCL was meant to ensure we have "the expected" vcl loaded/used on all instances (icinga check or whatnot) [17:48:40] by checking whether the hashes match [17:48:57] checking matches between different hosts in the same site+cluster against each other? [17:49:33] yes, assuming VCL files are identical in the same site+cluster+layer which might not the case [17:49:35] in the general case it would be just move the race elsewhere. it would be a race to who wrote down their expected hash in some place for icinga to check against, first. [17:50:48] but, since a meaningful race problem can only be created by sequences like load1,(diskchange2),load2,use2,use1 [17:51:17] merely having the reload-vcl script itself lock for exclusivity with other instances of the same script around (at least) its load->use sequence would prevent it [17:52:19] nice, yeah [19:04:22] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4011439 (10Anomie) [19:24:10] ema: so like I was saying earlier, probably the vcl-reload patch is best done with puppet-disables on the cp fleet, and waiting a few after puppet-merge to ensure the file is updated from all POVs on the master(s). And even then, it's possible it ends up taking 2x puppet runs to get past some initial dependency-race failure. [19:24:32] ema: I'll plan to do it in my time tomorrow, unless you get bored and want to try it first [19:49:33] bblack: ok! [20:06:45] text @ eqiad done, that was the very last cluster to be upgraded \o/ [20:07:04] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_upload to Varnish 5 - https://phabricator.wikimedia.org/T180433#4011659 (10ema) [20:07:07] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_text to Varnish 5 - https://phabricator.wikimedia.org/T184448#4011657 (10ema) 05Open>03Resolved a:03ema [20:07:23] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#4011660 (10ema) 05Open>03Resolved a:03ema [20:45:34] 10Traffic, 10Operations: Post Varnish 5 migration cleanup - https://phabricator.wikimedia.org/T188545#4011757 (10ema) [20:45:43] 10Traffic, 10Operations: Post Varnish 5 migration cleanup - https://phabricator.wikimedia.org/T188545#4011773 (10ema) p:05Triage>03Normal [21:04:53] 10Traffic, 10Operations: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#4011880 (10ema) >>! In T165651#3704112, @BBlack wrote: > I don't think anything has changed since on Google's end. Do we try harder or just accept it? I guess we should try harder. :) The list... [22:02:27] ema: my thinking about no longer having to be concerned about frontend vcl.discard crashes was perhaps faulty :) [22:03:06] ema: because so long as we still have old VCLs loaded that pre-date both libvmod-netmapper update(+restart) *and* pre-date the fixing of the fetching of zero JSON files.... [22:03:30] ema: then those particular VCLs will still crash/hang the varnish process when you try to discard them. [22:03:54] ema: but I can fix that with vmod updates and restarts across eqsin pretty quick! :)