[12:33:44] <jbond42>	 ema: vgutierrez: getting an puppet error on all cp5* nodes
[12:34:20] <jbond42>	 https://phabricator.wikimedia.org/P15963
[12:34:45] <ema>	 jbond42: looking
[12:34:50] <jbond42>	 thx
[12:35:25] <ema>	 jbond42: how about other DCs?
[12:35:52] <jbond42>	 ema: didn;t show in in puppetboard just checking manually
[12:38:15] <jbond42>	 ema: just tested on cp3054 and worked fine, also updated the prefix change from /23 -> /20 i.e. it forced the reload logic
[12:38:24] <ema>	 mmh, maybe the issue is related to a7be77c4
[12:38:48] <ema>	 it seems that puppet fails because, eg, be_cp5013_eqsin_wmnet cannot be found
[12:39:27] <ema>	 (on upload)
[12:39:39] <ema>	 on text it's be_cp5015_eqsin_wmnet that cannot be found
[12:40:08] <jbond42>	 what are those strings?
[12:40:28] <ema>	 that's the way we call them in varnish
[12:40:35] <jbond42>	 ack
[12:40:49] <ema>	 so those are "the backends" from varnish-fe's point of view
[12:41:29] <jbond42>	 so i think specificly this https://gerrit.wikimedia.org/r/c/operations/puppet/+/683026/5/conftool-data/node/eqsin.yaml looks like the bit that may have been commited too early?
[12:41:57] <ema>	 I see that cp5015 for example is up and looks good, but it's not receiving traffic
[12:42:15] <ema>	 not sure if we were planning on putting those nodes in prod already or not 
[12:42:30] <bblack>	 they are in prod
[12:42:52] <bblack>	 or should be, anyways
[12:43:13] <ema>	 perhaps the confctl part is missing?
[12:43:20] <jbond42>	 here to help if needed but thius is beyond my varnish/ats knowlage
[12:43:27] <bblack>	 it was there days ago when I turned them on
[12:44:38] <bblack>	 $ confctl select 'name=cp501[3456].eqsin.wmnet' get
[12:44:38] <bblack>	 {"cp5015.eqsin.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=eqsin,cluster=cache_text,service=varnish-fe"}
[12:44:47] <bblack>	 [... all the expected bits ... ]
[12:44:56] <ema>	 yeah
[12:45:13] <bblack>	 maybe I missed some new place we need to copy the names to?
[12:45:20] <bblack>	 or some regex is maybe more likely
[12:45:55] <ema>	 hieradata/common/cache.yaml I think
[12:46:13] <bblack>	 yup, just found it heh
[12:46:38] <bblack>	 we need fewer sources of truth! :)
[12:46:42] <ema>	 most definitely :)
[12:47:10] <ema>	 bblack: should I add them or are you already on it?
[12:47:38] <bblack>	 I got it
[12:57:44] <bblack>	 the interesting bit here, is I think our puppetization generally catches VCL compilation errors on the causing change, but didn't here?
[12:58:04] <bblack>	 (and so it was lying in wait fo rthe next VCL-affecting change?)
[13:00:40] <bblack>	 I think this is because the VCL comes in through a couple of different mechanisms, but still something should've alerted
[13:01:03] <bblack>	 (basically the initial patch defined the confd-templated part, but not the base-VCL part with the backend defs that the confd-templated part references)
[13:01:16] <ema>	 right, I think that your puppet change did not touch the VCL
[13:01:37] <bblack>	 well, it touched the confd part indirectly, and I guess confd's VCL reloads should've falied shortly after
[13:02:23] <bblack>	 it's been like that for over a week
[13:08:36] <ema>	 yeah, so puppet did not fail because your change did not touch the VCL directly, but still the confd-based reload should have alerted
[13:09:07] <ema>	 requests coming in at the backend layer now https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?viewPanel=36&orgId=1&from=now-1h&to=now&var-site=eqsin%20prometheus%2Fops&var-instance=cp5015&var-layer=backend
[14:56:56] <ema>	 bblack: so confd ends up calling /usr/local/bin/confd-reload-vcl, which ends up writing a state file -- /var/run/reload-vcl-state
[14:57:25] <ema>	 right now that file is KO for all of eqsin (cp[5001-5016].eqsin.wmnet), while it's OK on all other nodes
[14:58:24] <ema>	 there's definitely something fishy, I'll file a task
[15:01:21] <bblack>	 yeah, I'm guessing the reason that statefile is still in a bad state, is that confd hasn't triggered its own reload (due to a pool status change) since the fixup
[15:02:32] <ema>	 I vaguely remember that we used to have an alert for this though, but icinga seems suspiciously happy 
[15:03:33] <bblack>	 yeah
[15:06:00] <cdanis>	 I do also confd alerts, and, I think something about the bad state file needing to be cleaned up by hand?
[15:06:04] <cdanis>	 s/also/also remember/
[15:06:39] <bblack>	 bblack@cp5016:~$ cat /var/run/reload-vcl-state
[15:06:39] <bblack>	 KO
[15:06:39] <bblack>	 bblack@cp5016:~$ /usr/local/lib/nagios/plugins/check_vcl_reload
[15:06:40] <bblack>	 reload-vcl successfully ran 52h, 22 minutes ago.
[15:07:57] <bblack>	 the script has bugs
[15:09:32] <bblack>	 if [ "x${STATE}"=="xOK" ]; then
[15:09:34] <bblack>	 should beL
[15:09:38] <bblack>	 if [ "x${STATE}" = "xOK" ]; then
[15:10:23] <bblack>	 that bug's been there since 2015 heh
[15:11:38] <ema>	 by now it's not an insect anymore, it's part of the family
[15:17:15] <bblack>	 it's also kind of interesting that shellcheck doesn't flag that
[15:17:33] <bblack>	 I mean, I guess it's all one string and it's legit for == to be part of string contents
[15:18:09] <bblack>	 but this seems like something it should at least be heuristically warning about (something about a bare == outside of quoting anywhere in a test-condition)
[15:30:33] <wikibugs>	 10Traffic: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10ema)
[15:33:06] <wikibugs>	 10Traffic: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10ema)
[15:36:07] <bblack>	 ema: in the bigger picture, aside from fixing any little bugs like this: we should probably move forward with experimentation on the local-backend-only stuff (which is not a current Q priority, but maybe if we have free time at the end, or we tackle it next Q early)
[15:36:28] <bblack>	 because if that line of simplifications works out, we can just drop everything about etcd/confd for backend cache layer.
[16:50:51] <wikibugs>	 10netops, 10Data-Persistence-Backup, 10SRE: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) This is not very urgent, but I am generating backups from eqiad to codfw at 173Mbps, which takes...