[12:33:44] ema: vgutierrez: getting an puppet error on all cp5* nodes [12:34:20] https://phabricator.wikimedia.org/P15963 [12:34:45] jbond42: looking [12:34:50] thx [12:35:25] jbond42: how about other DCs? [12:35:52] ema: didn;t show in in puppetboard just checking manually [12:38:15] ema: just tested on cp3054 and worked fine, also updated the prefix change from /23 -> /20 i.e. it forced the reload logic [12:38:24] mmh, maybe the issue is related to a7be77c4 [12:38:48] it seems that puppet fails because, eg, be_cp5013_eqsin_wmnet cannot be found [12:39:27] (on upload) [12:39:39] on text it's be_cp5015_eqsin_wmnet that cannot be found [12:40:08] what are those strings? [12:40:28] that's the way we call them in varnish [12:40:35] ack [12:40:49] so those are "the backends" from varnish-fe's point of view [12:41:29] so i think specificly this https://gerrit.wikimedia.org/r/c/operations/puppet/+/683026/5/conftool-data/node/eqsin.yaml looks like the bit that may have been commited too early? [12:41:57] I see that cp5015 for example is up and looks good, but it's not receiving traffic [12:42:15] not sure if we were planning on putting those nodes in prod already or not [12:42:30] they are in prod [12:42:52] or should be, anyways [12:43:13] perhaps the confctl part is missing? [12:43:20] here to help if needed but thius is beyond my varnish/ats knowlage [12:43:27] it was there days ago when I turned them on [12:44:38] $ confctl select 'name=cp501[3456].eqsin.wmnet' get [12:44:38] {"cp5015.eqsin.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=eqsin,cluster=cache_text,service=varnish-fe"} [12:44:47] [... all the expected bits ... ] [12:44:56] yeah [12:45:13] maybe I missed some new place we need to copy the names to? [12:45:20] or some regex is maybe more likely [12:45:55] hieradata/common/cache.yaml I think [12:46:13] yup, just found it heh [12:46:38] we need fewer sources of truth! :) [12:46:42] most definitely :) [12:47:10] bblack: should I add them or are you already on it? [12:47:38] I got it [12:57:44] the interesting bit here, is I think our puppetization generally catches VCL compilation errors on the causing change, but didn't here? [12:58:04] (and so it was lying in wait fo rthe next VCL-affecting change?) [13:00:40] I think this is because the VCL comes in through a couple of different mechanisms, but still something should've alerted [13:01:03] (basically the initial patch defined the confd-templated part, but not the base-VCL part with the backend defs that the confd-templated part references) [13:01:16] right, I think that your puppet change did not touch the VCL [13:01:37] well, it touched the confd part indirectly, and I guess confd's VCL reloads should've falied shortly after [13:02:23] it's been like that for over a week [13:08:36] yeah, so puppet did not fail because your change did not touch the VCL directly, but still the confd-based reload should have alerted [13:09:07] requests coming in at the backend layer now https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?viewPanel=36&orgId=1&from=now-1h&to=now&var-site=eqsin%20prometheus%2Fops&var-instance=cp5015&var-layer=backend [14:56:56] bblack: so confd ends up calling /usr/local/bin/confd-reload-vcl, which ends up writing a state file -- /var/run/reload-vcl-state [14:57:25] right now that file is KO for all of eqsin (cp[5001-5016].eqsin.wmnet), while it's OK on all other nodes [14:58:24] there's definitely something fishy, I'll file a task [15:01:21] yeah, I'm guessing the reason that statefile is still in a bad state, is that confd hasn't triggered its own reload (due to a pool status change) since the fixup [15:02:32] I vaguely remember that we used to have an alert for this though, but icinga seems suspiciously happy [15:03:33] yeah [15:06:00] I do also confd alerts, and, I think something about the bad state file needing to be cleaned up by hand? [15:06:04] s/also/also remember/ [15:06:39] bblack@cp5016:~$ cat /var/run/reload-vcl-state [15:06:39] KO [15:06:39] bblack@cp5016:~$ /usr/local/lib/nagios/plugins/check_vcl_reload [15:06:40] reload-vcl successfully ran 52h, 22 minutes ago. [15:07:57] the script has bugs [15:09:32] if [ "x${STATE}"=="xOK" ]; then [15:09:34] should beL [15:09:38] if [ "x${STATE}" = "xOK" ]; then [15:10:23] that bug's been there since 2015 heh [15:11:38] by now it's not an insect anymore, it's part of the family [15:17:15] it's also kind of interesting that shellcheck doesn't flag that [15:17:33] I mean, I guess it's all one string and it's legit for == to be part of string contents [15:18:09] but this seems like something it should at least be heuristically warning about (something about a bare == outside of quoting anywhere in a test-condition) [15:30:33] 10Traffic: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10ema) [15:33:06] 10Traffic: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10ema) [15:36:07] ema: in the bigger picture, aside from fixing any little bugs like this: we should probably move forward with experimentation on the local-backend-only stuff (which is not a current Q priority, but maybe if we have free time at the end, or we tackle it next Q early) [15:36:28] because if that line of simplifications works out, we can just drop everything about etcd/confd for backend cache layer. [16:50:51] 10netops, 10Data-Persistence-Backup, 10SRE: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) This is not very urgent, but I am generating backups from eqiad to codfw at 173Mbps, which takes...