[08:46:57] cp4021's persistent cache has finally been filled entirely (~1.5T) https://grafana.wikimedia.org/d/000000610/ats-instance-drilldown?panelId=5&fullscreen&orgId=1&from=1556528071712&to=1556592821041 [09:00:37] hello dear traffic team [09:01:55] im reading https://wikitech.wikimedia.org/wiki/LVS#Deploy_your_changes because i need to deploy an LVS service today, and in point 5 it states to ask here about which lvs balancer is backup and active on the DC im going to deploy the service [09:02:06] isnt there any other way to look it up that doesnt involve asking here? [09:02:11] puppet? [09:05:34] fsero: modules/lvs/manifests/configuration.pp [09:05:41] oh, and hello! [09:06:23] its active, passive ? [09:06:41] yes: for each tuple, the first element is the active and the second is the backup [09:07:07] great i'll update the docs, thanks ema [09:07:08] double-checking by running dstat on the hosts and seeing which one has traffic won't hurt [09:07:34] sure [09:09:24] I usually check the BGP MED on pybal.conf on the affected LVSs [09:09:34] that and be sure that pybal is running of course :) [09:10:26] fsero: listen to vgutierrez, not to me [09:10:32] (that's good advice in general) [09:10:37] i disagree [09:10:39] nope nope [09:10:46] I'm the newbie here [09:10:52] * vgutierrez hides [09:10:55] is good to hear him from time to time [09:11:09] thats good advice ema [09:11:43] bgp-med got it vgutierrez [09:11:45] thanks! [09:11:54] (ill include that too) [09:11:56] lower the number, higher the priority :) [09:12:27] our standard setup sets bgp-med 0 to the active LVS, and 100 to the passive one [09:12:41] yep [09:21:43] fsero: if you like parsing regular expressions, you can also `git grep profile::pybal::primary` [09:22:15] that hiera setting is used to set bgp-med (0 if true, 100 otherwise) [10:28:46] thanks ema [10:29:02] i managed to do it with your help and the help of _joe_ , and didnt earned a tshirt yet [10:29:55] sorry about that [10:30:07] you're gonna get other chances I'm sure [10:32:46] I feel you fsero [10:32:53] I want that t-shirt too [10:36:18] ema, vgutierrez im going to amend those docs do you want people to write here when they are restarting pybal or just !log the action would suffice? [10:36:40] i think !log is way better as we would keep the timestamp for incident response [10:36:54] but dunno if what is wrote there is your preference [10:38:12] yeah, !log is required for a pybal restart AFAIK [10:42:52] mm no according to shat wiki page [10:42:54] :) [10:44:28] well... usually changes that can result in pages must be logged [10:44:47] a pybal restart is definitely in that category [10:45:29] fsero: I actually didn't document that step on purpose [10:45:49] so someone from traffic will be aware that we are touching their toys [10:46:28] IMHO we should be confident to deploy changes without anyone noticing and if something goes wrong people from traffic should be always check SAL [10:46:30] first [10:46:34] or #operations [10:47:39] <_joe_> +1 [10:48:00] <_joe_> to both of you :) [10:48:29] <_joe_> I think it's good hygene to give a heads up to traffic folks when you start merging such a change [10:48:41] <_joe_> but we don't need to ask permission for the individual steps [10:48:58] <_joe_> and !logging should be enough [10:51:28] indeed [10:56:24] jijiki: anyway the document was pretty good, so thanks for that :) we should just keep it improving i [10:56:26] *it [10:56:43] and luckily replace this procedure so it doesnt have so many manual knobs and changes [10:56:47] * fsero crying inside [10:57:25] hahaha [11:48:47] gilles: hey! Did you manage to run the VTC tests? [11:49:27] I did, but they all failed for me locally. I ran out of time yesterday to look into why. I'm on MacOS, though, so maybe not ideal to run varnish on [11:49:43] I can try again from inside my Vagrant VM [11:49:59] ah yes, you need our Debian packages :) [11:50:20] see modules/varnish/files/tests/README [11:51:35] will run the tests against your change after lunch, bbiab! [11:54:25] they all fail on vagrant as well [11:54:28] *** v1 0.1 debug| Error:\n [11:54:28] *** v1 0.1 debug| Unknown parameter "vcl_path".\n [11:55:41] I'll try again after installing the packages from our repo. on vagrant I build varnish from source [11:57:55] still failing with that error [11:58:30] context: [11:58:35] *** v1 0.0 PID: 5090 [11:58:35] **** v1 0.0 macro def v1_pid=5090 [11:58:35] **** v1 0.0 macro def v1_name=/tmp/vtc.4903.68934a9a/v1 [11:58:35] *** v1 0.0 debug| Error:\n [11:58:38] *** v1 0.0 debug| Unknown parameter "vcl_path".\n [11:58:41] **** v1 0.0 STDOUT poll 0x10 [11:58:42] **** v1 0.0 CLIPOLL 1 0x0 0x8 [11:58:55] the macro tmp file is gone by the time the tests have complete, though, so I don't know if the syntax error is in that file [12:03:35] 10Traffic, 10Operations, 10Performance-Team: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) p:05Normal→03Low [13:16:11] gilles: give a try to this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/507316/ [13:35:03] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4022.ulsfo.wmnet'] ` The log can be... [13:39:11] ema: I was looking at our old X-Cache parsing, and I guess local hits in eqiad are actually counted as remote based on the regex [13:39:22] (which is fine, we just have to remember how to interpret the results) [13:39:56] since e.g. "hit-local" is defined as a "hit,[^,]+$" [13:40:24] in general the "-local" and "-remote" distinction seems dubious to me, those regexes [13:40:46] but either way, in eqiad they converge (it all gets counted as one or the other), and I guess for ats like ulsfo they will do the same [13:41:21] oh nevermind, I'm thinking about the line backwards [13:42:00] so yeah, eqiad's local hits should really register as local hits (not remote), and for the ATS case we should see remote hits (well remote anything) drop towards zero as conversion happens, leaving just front/local results like eqiad has. [13:43:10] I was trying to figure out what to expect on the now-poorly-named "varnish-caching" graphs: [13:43:13] https://grafana.wikimedia.org/d/000000500/varnish-caching?refresh=15m&panelId=8&fullscreen&orgId=1&var-cluster=cache_upload&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-24h&to=now [13:43:35] I think the local+remote lines there should converge as we go, ending up as it looks in eqiad [13:43:46] err local and overall, in that graph [13:45:30] ah yes, it's gonna be all either frontend or local w/ ATS [13:46:59] also interesting will be whether there's any significant shift in total cache hitrate there [13:47:42] but it will be hard to see it unless it's quite significant, since it will take another few days after they're all converted to stabilize, and by then so many other factors could shift around and blur comparison. [13:48:30] right [13:49:01] last-week-comparison can help a bit to see if/how much the reimages mess things up, but that's about it [13:49:11] https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?refresh=15m&orgId=1&var-cluster=upload&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [13:51:21] noteworthy: all upload cron restarts have changed when I've reimaged cp4021, and then again today. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/507267/ should ensure that all upcoming reimages leave crontabs as they are [13:52:30] ah good catch! [13:53:38] it will be nice to be rid of those, they cause so much noise and disruption [13:54:31] yup! [13:55:16] meanwhile I'm enjoying staring at pcc diffs removing ipsec. Very rewarding :) [13:56:51] :) [14:09:40] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4022.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4022.ulsfo.wmnet'] ` [14:10:06] puppet was apparently hanging after: [14:10:08] Notice: /Stage[main]/Main/Node[__node_regexp__cp4021-2.ulsfo.wmnet]/Interface::Add_ip6_mapped[main]/Exec[enp5s0f0_v6_token]/returns: executed successfully [14:10:19] killed it, re-running manually [14:10:55] yeah, not horribly surprising really :/ [14:11:30] I havne't reimaged a cache node in a while, but last I recall a lot of the meta-stuff like that had all kinds of interesting problems solveable most-easily by re-running puppet over and over until success [14:11:35] perhaps it would have gone forward eventually, I see now that the reimage of cp4021 took more than one hour https://phabricator.wikimedia.org/T219967#5130518 [14:11:45] heh [14:12:45] randomly unrelated, I just noticed authdns1001's gdnsd uptime has crossed the ~180d barrier now [14:13:15] (meaning for a half a year now stretching back to beta-versions, we've had only smooth takeovers on software ugprades and config changes, no actual hard stop->start cycles) [14:14:19] https://phabricator.wikimedia.org/T222177 does this Varnish startup error look familiar? [14:15:29] gilles: it needs linking to libmaxminddb [14:15:44] oh I see you already noticed that heh [14:16:07] that much is clear, but how is that done? and how come this isn't happening already? I assume that host is running mostly the same puppet code as production [14:16:26] in production, our unit file for varnishd is: modules/varnish/templates/initscripts/varnish.systemd.erb [14:16:39] I was a bit worried when I saw that the service hadn't been restarted since 2018... [14:16:43] there's a chunk in there like this: [14:16:47] <% if @vcl_config.fetch("enable_geoiplookup", false) -%> [14:16:47] Environment="CC_COMMAND=exec gcc -std=gnu99 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall -pthread -fpic -shared -Wl,-x -L/usr/local/lib/ -o %%o %%s -lmaxminddb" [14:16:51] <% else -%> [14:16:54] Environment="CC_COMMAND=exec gcc -std=gnu99 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall -pthread -fpic -shared -Wl,-x -o %%o %%s" [14:16:57] <% end -%> [14:17:13] maybe the vcl_config check for enable_geoiplookup isn't working and is taking the second clause there? [14:17:30] probably [14:17:32] see what the post-templating systemd unit looks like on the end host [14:17:44] yeah I'm going to look at that now [14:18:50] Environment="CC_COMMAND=exec gcc -std=gnu99 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall -pthread -fpic -shared -Wl,-x -o %%o %%s" [14:19:40] for us, that's set in: [14:19:40] hieradata/role/common/cache/canary.yaml: enable_geoiplookup: true [14:19:40] hieradata/role/common/cache/text.yaml: enable_geoiplookup: true [14:19:51] I imagine it's in labs-specific (horizon or whatever) hieradata there [14:20:18] those lines are underneath: [14:20:19] profile::cache::varnish::frontend::fe_vcl_config [14:21:30] I'll go check that in horizon [14:28:18] uh oh, why do I feel like I went near that hiera recently [14:29:15] hm we have profile::cache::varnish::backend::be_vcl_config [14:29:42] ah, that's a recent addition [14:29:44] as of 18 days ago [14:29:51] the profile::cache::varnish::frontend::fe_vcl_config [14:29:54] in prod [14:30:20] ah yes! My fault! [14:30:46] well this happens anytime we refactor hieradata basically [14:30:58] "git grep" doesn't find horizon :P [14:31:02] I'm saving the hiera change right now for the text puppet prefix [14:31:23] should I also add these? [14:31:29] admission_policy: 'none' [14:31:29] # RTT is ~0, but 100ms is to accomodate small local hiccups, similar to [14:31:29] # the +100 added in $::profile::cache::base::core_probe_timeout_ms [14:31:31] varnish_probe_ms: 100 [14:31:33] keep: '1d' [14:31:49] to the beta config [14:32:22] and should the same things be added to cache-upload as well? [14:32:55] already attempting to do it but horizon's puppet panel is slow as ever [14:33:18] oh, there is this already: [14:33:18] profile::cache::varnish::frontend::fe_vcl_config: [14:33:19] enable_geoiplookup: true [14:33:31] maybe someone beat me to it [14:33:36] just above ^ [14:33:47] aha [14:33:47] I just did [14:33:53] gilles: yes, you should add those values too (admission_policy, varnish_probe_ms, keep) [14:33:59] ok [14:34:34] it does need varnish_probe_ms to run puppet it seems [14:34:53] ema: and for cache-upload as well? [14:35:46] gilles: yes [14:35:53] not because they're actually particularly useful in labs, but because I don't think we set defaults [14:37:24] cp4022 deployed w/ ATS, I'm manually removing the icinga downtimes to see if anything is broken (I don't think so) [14:38:34] Krenair: is there a way to speed up the hosts picking up the hiera changes in horizon? [14:38:47] i.e. to manually pull the changes [14:38:58] as soon as it's saved in horizon it should be available to puppet [14:39:04] just gotta run puppet [14:39:26] slow part of the process is our custom horizon puppet UI code [14:40:49] things appear to have been unfucked 👍 [14:40:54] nice [14:41:05] thanks for sorting this out all [14:44:04] gilles, hopefully at some point we'll consolidate the different sources of hiera used in labs, and I doubt the solution left will be the current horizon puppet stuff [14:48:37] right now I'm imagining something new (yes I know, XKCD 927 etc., should probably get rid of an existing one or two before starting another) that involves git repositories (to keep history and so grep is easy) with ACLs for project admins [14:50:20] none of the existing sources are satisfactory :/ [14:50:44] just put it all in dockerfiles then you don't need git or puppet :) [14:58:19] ema: class { '::geoip::dev': } is defined for class profile::cache::varnish::frontend::text but not for upload, which resulted in the maxmind debian package to be missing from the beta upload host (and needed by this new stuffà [14:59:58] gilles: that's by design, there's no need for geoip on upload [15:00:20] ema: I just asked and you said that I needed to add the same config to the upload host :) [15:00:32] the profile::cache::varnish::frontend::fe_vcl_config stuff [15:00:43] ah, it has different values for upload? duh [15:01:02] gotta remove that option on horizon, then [15:02:18] gilles: sorry for the confusion, enable_geoiplookup should be false on upload [15:03:46] yeah, got it now [15:20:47] 10netops, 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) 05Open→03Resolved @Papaul thanks! Working :) [16:03:39] so, cp4022 looks good! ready to pool [16:05:20] :D [16:08:14] 10Traffic, 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: SwiftMedia URL rewrite returns some 404s with wrong Content-Length - https://phabricator.wikimedia.org/T222071 (10fgiunchedi) I'm wondering if the underlying issue here (copying responses inside `rewrite.py`) could be the culprit... [17:12:30] <_joe_> hi traffic people! [17:12:54] <_joe_> how bad it would be if I needed to ban all urls for restbase for a wiki? [17:13:17] <_joe_> I mean it's something that can be done, or it would create issues with varnish if we ban that many urls? [17:13:38] <_joe_> although it's not /that/ much cached content to be honest [18:17:29] depends on the wiki I guess [18:18:10] but it's probably ok, maybe some increased RB load as a result, for a bit [21:25:21] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Andrew) *bump* I still need something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ in order to get cloudvirt1024 online (and to pave the way towards... [21:32:39] 10Domains, 10Traffic, 10Operations: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10Dzahn) p:05Triage→03Normal [23:43:41] there's quite a bit of cronspam from cp5* [23:43:42] puppetlabs.facter - Could not process routing table entry [23:43:59] the pattern is all are cp5*