[00:38:22] 10Domains, 10Traffic, 10Design-Research, 10Operations: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Dzahn) [06:32:22] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10DannyS712) [08:50:32] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) a:05ayounsi→03BBlack That sounds like a good idea to me, @BBlack for a final opinion, and I can take care of it this Q if good to go. [09:06:43] 10netops, 10Operations: Stale LibreNMS ports - https://phabricator.wikimedia.org/T242318 (10ayounsi) 05Open→03Resolved `root@cumin1001:~# for i in `mysql.py -hdb1135 -e "select table_name from information_schema.columns where column_name like 'device_id'" -BN`; do echo $i; mysql.py -hdb1135 librenms -e "de... [11:11:18] 10Traffic, 10Operations, 10Patch-For-Review: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 (10ema) p:05Triage→03High [12:00:39] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@mw1321:~$ curl --resolve fixcopyr... [12:00:55] 10Traffic, 10Cleanup, 10Operations, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Vgutierrez) [12:07:43] volans: ^^ I'd say ask your pybal question and we will try to figure it out ;) [12:08:08] I'm no pybal expert, but still :) [12:08:09] vgutierrez: sorry, already figured it out out of band with m.ark [12:08:13] oh cool [12:08:15] forgot to mention it here [12:08:18] even better [12:08:55] TL;DR: when adding new servers in etcd via hiera/conftool we now have a default weight of 0, because we removed the services that were defining a default weight per-service [12:09:40] but pybal doesn't really support weight 0, and to support it it needs the major refactoring discussed in the past around FSM and such [12:09:59] so if the operator when pooling for the first time the server just sets the pooled=yes value and doesn't change the weight [12:10:11] we endup with a pooled=yes,weight=0 state. [12:10:33] In this case pybal doesn't pass the weight option to ipvsadm (there if server.weight: cmd += ...) [12:10:39] and ipvsadm has a default weight of 1 [12:11:06] so we endup with etcd having pooled=yes,weight=0 and IPVS having a pooled server with weight=1 [12:11:31] now the only bit I didn't investigate is why our icinga alert for discrepancy doesn't alert in this case [12:11:34] should it? [12:12:48] that's interesting [12:13:08] cause https://wikitech.wikimedia.org/wiki/Conftool#Add_a_server_node_to_a_service doesn't mention that the weight must be set [12:13:45] actually the whole page doesn't have an example that sets the weight :) [12:13:57] I think _joe_ has some sort of script to handle the whole thing, maybe just need to be documented/publicized? [12:26:00] <_joe_> vgutierrez: I created a script, that is puppetized, that should be part of the procedure to initialize a cache node [12:26:08] <_joe_> exactly to set the weight correctly [12:26:37] does it work for any cluster or is cache-specific? [12:27:06] <_joe_> it works for any cluster but I think it's not included/configured elsewhere [12:27:16] k [12:27:29] <_joe_> the name of the script is "initialize" [12:28:02] <_joe_> ./modules/conftool/manifests/scripts/initialize.pp [13:37:55] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10BBlack) +1 from me, this was one of the many things we made the ganeti clusters for :) [14:18:14] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) a:05BBlack→03ayounsi [15:21:35] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability, and 2 others: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10ema) [15:28:48] 10netops, 10Operations, 10ops-eqiad: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:59:35] how do we do simple ban/purge on ats-be? [15:59:48] I see some docs on wikitech, but defining new lua code for each case temporarily? :P [15:59:50] bblack, ema: by backporting PR 4028 from upstream (https://github.com/apache/trafficserver/pull/4028) we now can tell between connect and TTFB timeout, this has been backported in 8.0.5-1wm12 that is currently being testeed on cp4026,4032,5006 and 5012. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564711 sets those timeouts accordingly in those 4 hosts [16:00:24] https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Forcing_a_cache_miss_(similar_to_ban) [16:04:46] bblack: hmm I'm not aware of any other way, but maybe ema is [16:07:36] I'm playing the part of "unknowing non-traffic person who made some site change involving redirects and wants $minor_microsite_01's whole domain wiped from cache" [16:07:53] which before we'd do with a ban on foo.wikimedia.org or whatever [16:10:58] bblack: one-off purges are done as in varnish (purgeList.php), bans don't exist [16:11:07] so yeah that lua is the closest we can get to a ban AFAIK [16:15:09] right in this case the request is to purge all of a given microsite: ^transparency.wikimedia.org/.* [16:15:15] as a one-shot purge [16:15:44] I see that the Cache Inspector utility apparently supports Regex Delete/Invalidate [16:15:45] so we define lua for that, then remove it after ... some days of missing all caching to make sure all the natural entries expired? [16:15:54] https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/storage/index.en.html#using-the-cache-inspector-utility [16:16:57] "Only one administrator should delete and invalidate cache entries from the Cache Inspector at any point in time. Changes made by multiple administrators at the same time can lead to unpredictable results." [16:17:06] and it's an HTTP UI tool on a single instance? :P [16:17:28] yeah, I'm suspicious. :) It would need evaluation and testing, right now I'd say lua and remove it after expirations [16:19:25] now I'm wondering if we should do the same for fixcopyright.wm.o [16:20:34] re: T239141 [16:20:35] T239141: Redirect all traffic for fixcopyright.wikimedia.org to https://policy.wikimedia.org/policy-landing/copyright/ - https://phabricator.wikimedia.org/T239141 [16:31:03] vgutierrez: so with 564711 we decrease connect_timeout on the tls instances from 3m from 3s? That seems reasonable :) [16:31:16] right [16:31:24] 3s on ats-tls, 10s on ats-backend [16:32:11] maybe those 10s for ats-backend are quite conservative.. but taking into account that we are talking about TLS connections and cross-DC.. I went for a safe value [16:32:59] I forgot, do we have metrics for these timeouts? [16:34:08] so till now we couldn't tell between connect and TTFB timeouts, so that's a little bit tricky [16:34:31] right, but do we have metrics about $whatever_its_called_now_timeout [16:34:43] rate of 502s could be a nice one [16:35:37] I don't see any other way.. considering that ATS doesn't provide those specific metrics [16:36:20] alright, it would seem useful to have those counters to me [16:37:25] 502: it's an indicator, but there's confusion if the origin for some reason also returns 502 [16:38:10] anyways 564711 looks good to me, +1. Do we need to restart or are the parameters reloadable? [16:38:56] * vgutierrez checking [16:39:43] a reload is enough [16:39:49] very well [18:32:49] 10Traffic, 10Operations: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 (10Vgutierrez) [23:41:50] 10Traffic, 10Cleanup, 10Operations, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) 05Stalled→03Open