[07:54:24] moritzm: cp reboots for kernel updates done, FYI [08:14:47] ack, great! [08:15:47] 10Traffic, 10Operations, 10Patch-For-Review: Enable numa_networking on all caches - https://phabricator.wikimedia.org/T193865 (10ema) 05Open>03Resolved a:03ema [08:45:29] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) >>! In T164609#4483549, @Joe wrote: > Sometimes we get 503 peaks from a `cache_misc` application like phabricator or gerrit; knowing the origin of the 5xxs in broad c... [08:46:48] anybody around for a quick sanity check on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452321/ ? [08:47:18] <_joe_> ema: I can take a look [08:47:22] ty! [08:48:21] <_joe_> easy enough [08:48:29] :) [08:48:32] <_joe_> ema: how good are you with apache configs? [08:48:37] <_joe_> yes, this is a trick question [08:48:44] it's Monday morning [08:48:58] but I can try my best! [08:49:13] <_joe_> so that I can unload some of the review duties from volans [08:49:27] <_joe_> I have a botload of scary apache patches in the making [08:49:45] let's divide and conquer [08:49:49] <_joe_> but once they're done, everyone should be able to change things in our config [08:49:53] nice [08:51:53] <_joe_> I'm converting all wikis to use https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/mediawiki/manifests/web/vhost.pp [08:52:24] <_joe_> (the interesting part is the template ofc) [08:52:56] I'll make a coffee and then dust my apache pocket reference [09:07:54] [nitpick] as volans would say: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/mediawiki/manifests/web/vhost.pp#13 [09:08:07] s/Wether/Whether/ [09:08:21] good boy :-P [09:08:41] (google image search for Wether suggested) [09:08:42] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10mark) [09:08:46] morning! [09:09:17] <_joe_> hi mark [09:09:21] o/ [09:14:12] morning mark [11:06:29] ema: I've also just doublechecked that all cp servers picked up the SSBD microcode correctly (with the known exception of cp1008 which has a CPU no longer supported by Intel) [11:29:21] nice [12:54:14] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10BBlack) >>! In T164609#4497768, @ema wrote: >>>! In T164609#4483549, @Joe wrote: >> Sometimes we get 503 peaks from a `cache_misc` application like phabricator or gerrit;... [13:03:43] bblack: in case you like the smell of lua in the morning! https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451838/ [13:05:54] is it fresh lua? [13:07:50] for some definition of fresh, yes [13:10:04] * ema was working on a pun involving puppet refreshes but decided to stop [13:10:50] hahahha [13:12:01] vgutierrez: your input welcome too of course [13:37:04] ema: any chance there's an easier way to get the (cached?) hostname value directly? [13:37:47] We may eventually have to bow to pragmatism, but my idealistic side would like to hold off as long as possible on ".lua.erb" hell. It would be nice if the lua code can just be code, and any data comes in some other / more-direct way. [13:38:54] e.g. lua might have access to uname(2), although maybe some API offered by ATS itself has some cached notion that doesn't take a syscall every time just in case the hostname changed. [14:07:49] <_joe_> if I might [14:08:01] <_joe_> you should have a specific file that might be .lua.erb [14:08:06] <_joe_> where you fill variables [14:08:14] <_joe_> a config file basically [14:08:24] yeah that might make sense if we need it [14:08:31] <_joe_> and the rest should just use those variables [14:08:56] if possible I'd like to avoid getting into a pattern where "config change means having a daemon reloading templated code" sort of thing [14:09:19] <_joe_> not sure I get what you mean [14:09:25] e.g. if we really do need to puppet-template some data variables, maybe it's just a yaml file that the lua code has a thread watching mtime on, rather than templated code + ATS-level code-reload [14:09:42] <_joe_> can lua do that in ats? [14:09:47] I have no idea! :) [14:10:08] but I assume we have most of the general language facilities, I just don't know what API hooks are present for implementing e.g. side-threads or watchers of any kind [14:10:12] <_joe_> I don't see why you should write a yaml rather than a lua file from puppet, it saves you some local work [14:10:31] <_joe_> I doubt the lua engine in ats allows multithreading [14:10:35] becaues it's Just Data, not code implying actually recompiling/reloading all the lua code [14:10:40] <_joe_> but I might be wrong :) [14:11:05] (or someone getting clever later and putting snippets of actual lua code in the supposedly-data-only lua file) [14:11:15] <_joe_> yeah, that mostly [14:11:52] but I assume there's some costs/risks to recompiling/loading the lua in ATS, that wouldn't be present for a simple datafile watcher. [14:12:19] or maybe ATS has some other general facility for this using its "map" or something similar (where we can have ATS itself load a key:value file that's available to Lua code) [14:12:30] <_joe_> yeah I just doubt it wouldn't be harder/more buggy to implement unless ATS has it prebaked [14:12:37] <_joe_> that :P [14:13:49] I just don't like the idea that our ATS Lua code (which is really code, just like any application/service) should be deployed via puppet templating insertions. That seems needlessly complex in other ways. [14:13:53] code is code, data is data [14:14:08] <_joe_> so you want hiera for ats [14:14:09] and the whole language-layering thing, now ruby+lua [14:14:13] <_joe_> :D [14:14:23] and config is config, I should've added [14:19:15] ah-hah [14:19:31] https://docs.trafficserver.apache.org/en/latest/admin-guide/plugins/lua.en.html#ts-schedule [14:19:59] I bet you can do a start-time hook that uses that ts.schedule() to run a periodic co-routine that could reload some data [14:20:25] oh even better, right below that [14:20:43] i was gonna say :P [14:20:44] ts.config_(int|float|string)_get() [14:20:54] "did you mean the configs right below that" ;p [14:20:58] assuming they let us define our own config vars that the core doens't care about [14:20:59] +options [14:21:59] could be worse, IIRC there used to be a .py.erb that when run would spit out an lua script to be executed by pdns-recursor [14:22:13] there's also ts.mgmt.get_(int|float|string)() [14:22:22] I'm not really sure how "management" params differ from config [14:23:22] currently is worse: we have situations now where puppet variables flesh out an erb ruby template which generates a go template, which confd interprets as go code generating VCL, which is compiled to C and loaded into Varnish [14:23:40] ew [14:27:16] "compiled" is kind of generous too heh [14:27:34] VCL is more like a macro language that's transliterated into C, then the C is compiled [14:28:29] (but hey, at least we don't have inline-C in the midst of the VCL that's generated by go->ruby->puppet) [14:29:15] somehow to make it worse, you should have an IRC channel as transport in the mix [14:29:40] oh good idea [14:29:56] it's not like it's unprecedented ;) [14:30:20] we've sent executable code over IRC before? [14:30:26] !cache-mangler: inject to vcl_recv: req.http.host ~ "<%= @scope.function(....) %>"' [14:30:49] it would be the ultimate expression of chatops! [14:45:13] vgutierrez: https://grafana.wikimedia.org/dashboard/db/tls-ciphersuite-explorer?panelId=3&fullscreen&orgId=1&from=1534142336602&to=1534147240075 [14:45:42] so apparently the timeline from "realtime stat goes zero" to "grafana finally says zero" is .... 12 days + ~4h ? :) [14:47:20] err 3h, but whatever. I was really expecting to be exactly 7d or 14d based on some moving average somewhere [14:47:52] maybe we have multiple different moving averages at multiple layers here. [14:54:15] maybe mildly interesting just to see what others are doing, re: generating local CAs and test certs, maybe nifty tooling for some artificial testing: https://github.com/FiloSottile/mkcert [15:10:05] bblack: yup I've seen that [15:10:48] regarding certificates... our current x509 module handles all of that already :) [15:11:46] and for that kind of thing, let's encrypt pebble uses MiniCA: https://github.com/jsha/minica [15:40:23] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) on boad NICs disable [15:45:17] bblack: fun, varnish-be crashed on cp1089 [15:45:38] bblack: see cp1089:~ema/panic.log [15:46:00] in brighter news: ts.mgmt.get_string('proxy.node.hostname') does the right thing [15:59:24] a very small POST request heh [16:00:13] ema: any idea of ts.mgmt.get_*() or ts.config_*_get() can be fed arbitrary config strings of our choosing (as in mgmt or config inputs can have arbitrary keys not defined by the ATS developers, that we might puppetize in?) [16:01:15] (meeting) [16:02:16] bblack: I wouldn't think you can, but you there's a way to pass arguments to a lua script [16:08:40] bblack: and guess what, an example of that is called sethost.lua :) [16:08:55] https://docs.trafficserver.apache.org/en/7.1.x/admin-guide/plugins/ts_lua.en.html#synopsis [16:22:13] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:34:09] <_joe_> I hope docs.trafficserver.apache.org is not served via ATS [16:34:14] <_joe_> it's hanging for me [16:34:15] <_joe_> :P [16:34:55] funny, because also this link gave me a 503 back on Friday and it's still 503ing: https://varnish-cache.org/lists/pipermail/varnish-dev/2013-April/007544.html [16:35:13] 10Traffic, 10Operations, 10Performance-Team: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) p:05Normal>03High a:03Imarlier [16:57:33] 10Traffic, 10Operations, 10Performance-Team: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10BBlack) The incident report needs some deeper updates (will work on that today), but it's almost certainly related to https://wikit... [17:16:37] 10Traffic, 10Operations, 10Performance-Team: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) @BBlack That was my assumption as well. I want to verify that the agents in AWS are, in fact, routing to codfw, but that... [17:30:30] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:33:30] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:33:54] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10jcrespo) So db1066 is the s2 eqiad master active, so any downtime there means the s2 wikis go read only: https://noc.wikimedia.org/db.php#tabs-2 [17:33:58] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10fgiunchedi) re: ms-be1040 it can be moved back to the old switch any time [17:34:06] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [20:28:25] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) 05Open>03Resolved [21:11:15] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [21:15:28] 10netops, 10Operations: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) @Cmjohnson Could you pre-cable the hosts that will terminate on asw2-a5(ex4500) ? Not unplug anything, but have the fibers ready. [22:45:07] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp1089&service=Varnish+backend+child+restarted [22:46:25] Varnish backend child restarted on cp1089 (6 gt 3). if we see that what is appropriate action? stop all and start again?