[07:30:26] bblack: +1! [07:58:11] 10Traffic, 10SRE, 10Patch-For-Review: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10ema) p:05Triage→03Medium [08:43:30] hmmm puppet's been disabled on cp5016 since Friday but is still pooled [08:43:41] The last Puppet run was at Fri May 14 14:47:47 UTC 2021 (3954 minutes ago). Puppet is disabled. reason not specified [08:44:12] so it was not even disabled with disable-puppet scrip [08:44:14] t [08:44:28] the script sets a reason always [08:44:49] ema: from friday I see you've been debugging some vcl reload stuff in there with b.black [08:46:13] ema: I've depooled the node cause it already had invalid OCSP stapling responses for wikiworkshop.org (caused by puppet being disabled) [08:47:52] please let me know if it's ok to enable puppet on cp5016 [08:47:53] vgutierrez: ack, yeah we might have used cp5016 to figure out T282880 [08:47:54] T282880: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 [08:48:04] yes please, feel free to re-enable [08:48:08] ack [08:58:18] vgutierrez: your depool/pool also fixed /var/run/reload-vcl-state on text@eqsin (it was still KO before) [08:58:36] sorry? :) [08:59:25] so due to T282880 all nodes in eqsin had KO in /var/run/reload-vcl-state instead of OK [08:59:26] T282880: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 [08:59:50] depooling / pooling caused confd to call /usr/local/bin/confd-reload-vcl (see /etc/confd/conf.d/_etc_varnish_directors.frontend.vcl.toml) and hence fix the state file [09:18:44] 10Traffic, 10SRE, 10Patch-For-Review: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10Volans) a:05crusnov→03None [09:22:48] volans: I've trained my muscle memory to automatically answer "76" to cumin already, nice try though! [09:25:16] ema: lol, hopefully that number is not that stable, single DC, multiple DCs, all of them, etc... :-P [09:35:23] reCAPTCHA it is, then! [09:37:32] roman numerals [11:34:00] 10Traffic, 10SRE: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10jbond) > and is driven by the cache::nodes['upload']['eqsin'] hiera setting. In relation to this would it be better to pull this information directly from puppetdb. This would mean that list would cont... [11:41:59] 10Traffic, 10SRE: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10MMandere) >>! In T282787#7085273, @Volans wrote: > Other random things that needs to be updated sooner or later. I hope you don't mind if I drop them here, feel free to move them to a... [12:09:01] 10Traffic, 10Analytics, 10SRE, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) @CDanis the patch for Druid is there - sorry for not having acted quicker. [12:10:33] 10Traffic, 10Analytics, 10Analytics-Kanban, 10SRE, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) a:03JAllemandou [13:12:35] win 30 [15:29:36] 10netops, 10Data-Persistence-Backup, 10SRE: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10LSobanski) Bonus question, is there an option for some traffic shaping / QoS to remediate the above autom... [15:41:27] 10Traffic, 10SRE, 10Patch-For-Review: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10jbond) > I created a quick [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/692286/10 | PoC ]] which would be called with e.g `wmflib::cache::nodes('upload', 'eqsin')` After... [17:06:31] 10netops, 10Analytics, 10SRE: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10ayounsi) @razzi from our IRC chat, the way I'd approach it is: - for all the removed IPs, check if the host still exist, most of the cases it's just that the host is gone and the ACL never got updat... [18:37:37] 10Traffic: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [18:37:57] 10Traffic: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [18:38:03] 10Traffic, 10SRE, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [22:41:48] 10Acme-chief, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10Bstorm) At this point, we have suggested something like a regular restart for the service via systemd timer. That should be an easy enough fix... [23:32:00] 10Acme-chief, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10Dzahn) You could take the script that Icinga _would_ use but use it yourself without all the Icinga around it. So take `modules/nagios_common...