[02:02:59] 10Traffic, 10Operations, 10TechCom-RFC, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10CCicalese_WMF) [02:22:27] 10Traffic, 10Operations, 10Performance-Team: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) 05Open>03Resolved Confirmed that WPT agents are resolving to the codfw edge. Given that this means that they're goin... [04:44:05] 10Traffic, 10Operations, 10TechCom-RFC, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Joe) We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they... [08:09:16] ema: bblack gave me a +1 here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450020/ are you ok with merging this? [08:12:43] vgutierrez: looks good, yes [08:13:27] ack, thx [08:15:34] legoktm: thanks for the review! [08:15:55] np :) [08:20:52] legoktm: oh, but that's a jessie image [08:21:10] lua-busted isn't available in jessie, sadness [08:23:14] ema: we could either backport it in apt.wm.o, move the image over to stretch (has to be done at some point anyways), or worst case, create a second docker image based on stretch and run two jenkins jobs [08:23:58] legoktm: alternatively, we can install luarocks (which is in jessie) and install busted with it [08:24:17] that's not very different from what we do with pip after all [08:24:19] options! :) [09:42:30] mutante: hey, that should not happen (of course!) :) [09:42:46] mutante: if you see the value staying relatively low just let us know please [09:43:06] mutante: if it goes up to the sky very quickly it might be a good idea to depool the host [10:20:44] <_joe_> ema: if it's a docker container go that way [10:21:10] <_joe_> the general consensus about docker images is "dump whatever shit into it, it's a container!" [10:21:38] <_joe_> or you can backport lua-busted [10:23:37] _joe_: so containers *are* great after all! [13:23:18] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5005.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [14:10:44] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5005.eqsin.wmnet'] ``` and were **ALL** successful. [14:18:25] bblack: cp108[79] are currently affected by issues similar (varnishlog-wise) to those we see after a few days of runtime, but I've just restarted them [14:19:05] -- Fetch_Body 3 length stream [14:19:05] -- ExpKill LRU_Fail [14:19:05] -- FetchError Could not get storage [14:22:43] cp1089's backend depooled as it kept on failing fetches [14:23:23] see https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1534254052664&to=1534256587399&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [14:25:10] ok [14:25:14] errors logged on cp1089.eqiad.wmnet:/root/503.log [14:26:03] on both there's been at least one child crash soon after restart [14:26:26] "those we see after a few days of runtime" you mean in the new-eqiad case specifically right? [14:26:44] yes [14:27:25] interestingly all the "Could not get storage" happen on transient memory [14:28:59] right I was going to say, this is clearly a miss->pass case without warming [14:29:04] so it's not about the nvme I guess [14:29:30] it even says it's malloc transient [14:29:41] how does it LRU_Fail on transient? :P [14:30:08] I mean, I guess this is a consequence of us putting specific transient limits in place on be's [14:30:20] but still, it should be able to evict to make room :P [14:30:24] yeah [14:30:45] looking at the 10,000 graphs of varnishy things on that box [14:30:53] 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I added sfp-t's to asw2-a5-eqiad for the new server in that rack. For the remainder of the 10G servers in rack's 2/4/6 do you want me to run cross connects to asw2-a5?... [14:31:17] bblack: I was currently looking at those and found that we need graph 10,001 (varnish-be transient storage usage) [14:31:21] heh [14:31:26] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-3h&to=now&panelId=21&fullscreen&var-server=cp1089&var-datasource=eqiad%20prometheus%2Fops [14:31:39] ^ is that graph right, with the markers for like 16 backend restarts? [14:32:36] yeah it was crashing [14:32:57] Aug 14 14:17:40 cp1089 varnishd[174266]: Child (176838) Panic at: Tue, 14 Aug 2018 14:17:40 GMT [14:33:00] Missing errorhandling code in vrb_pull(), cache/cache_req_body.c line 74: [14:33:03] Condition((STV_NewObject(req->wrk, req->body_oc, stv, 8)) != 0) not true. [14:34:02] and all the crash-trace reqs are POST to api.php [14:34:22] maybe there's some intersection of the fixup for vcl-switching + POST and miss->pass? [14:34:35] (or vclswitch+POST+out-of-transient) [14:35:22] yeah, so [14:35:52] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&panelId=69&fullscreen&from=1534254170493&to=1534256641459 [14:36:01] backend transient storage usage ^ [14:36:01] (a) eqiad will have the most be transient usage probably, since it's the backend-most where we do miss->pass a lot. [14:36:29] (b) adding cache_misc into cache_text probably bumped transient usage in the be's in general, because of all the pass-mode backends there? [14:36:53] (c) maybe now we're seeing the limit case due to both of the above and need to raise our be transient limits? [14:37:32] I mean I'm guessing LRU_Fail is because all the transient storage (or most anyways) is being taken up by live passes that can't be evicted [14:37:47] (including long-term entries perhaps, for etherpad/phab websockets?) [14:38:07] definitely need to raise be_transient_gb [14:38:36] (2G is the usage peek in the graphs) [14:38:47] we never did set up one for upload I guess [14:38:54] luckily! [14:39:32] maybe we should just flip it back to zero and roll through some eqiad restarts (there's almost nothing in actual caches there anyways) [14:39:37] and see what the natural numbers look like without caps [14:39:40] +1, on it [14:39:50] thanks! [14:42:00] bblack: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452680 [14:45:35] grr, we don't have a hiera defaul [14:45:42] will set to 0 instead [14:47:07] oh I was about to say then how is upload working? [14:47:13] but then I realized the whole setting is text-only :) [14:48:00] ema: ok! thanks [14:53:13] Overuse of Transient storage with hit-for-miss objects: https://github.com/varnishcache/varnish-cache/issues/2654 [14:53:17] that seems interesting [14:57:26] lol [14:57:30] :) [14:57:45] sounds like hfm objects are storing the whole body in transient :P [14:57:53] (based on those test stats) [14:59:14] I'll begin with the eqiad backend restarts (starting with nodes with crashed child) [14:59:31] ok thanks [14:59:51] I'm going into meeting mode soon, for a couple hours [15:01:03] k [15:20:29] _joe_: so the luarocks thing seems ok? https://gerrit.wikimedia.org/r/#/c/integration/config/+/452634/ [15:21:35] <_joe_> ema: yes, but you also need to bump the version in the changelog [15:21:51] <_joe_> if this is the end of your work on that, I mean [15:22:01] I certainly hope so [15:22:11] <_joe_> it's in debian format :) [15:26:01] _joe_: I see you've specified UNRELEASED in the distro field instead of wikimedia for 0.3.2, does anyone/anything care? [15:26:28] <_joe_> ema: I don't think anyone does [15:26:33] <_joe_> not for now at least [15:26:55] <_joe_> in the future we could think to control sub-namespaces via control file tags maybe [15:29:38] _joe_: CR updated with new changelog entry [15:31:26] reprepro will complain, but it can be overriden [15:32:03] <_joe_> moritzm: this is a docker image :P [15:32:31] <_joe_> ema: we should really update that container to be based on stretch [15:32:34] ah, ok :-) [15:32:50] <_joe_> and btw, we should start rebuilding the base images regularly [15:38:04] bblack: fun times! https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=varnish-text&var-layer=backend&from=1534251299800&to=1534260945025 [15:38:31] bblack: it looks like getting rid of the cap helped :) [15:57:53] 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I pre-cabled everything. The lvs cross connects only need to move racks to the new switch. We probably need to do those 1 at a time, because downtime may be close to 1m... [15:59:34] bblack: I'd say there's no need to restart the backends elsewhere to apply the change, we can wait for weekly restarts? [16:07:00] ema: +1 [16:16:55] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) ``` 2018-08-14 16:01:08,650 [docker-pkg-build] INFO - Generated dockerfile for docker-registry.discovery.wmnet/releng/operations-puppet:0.3.3: FROM docker-registry.d... [16:57:42] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) Thanks @Reedy! The `luarocks` part fails with: ``` Warning: Failed searching manifest: Failed extracting manifest file Installing https://raw.githubusercontent.com/ro... [16:59:36] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Yay, dependancies. Feel free to bump the package again and add unzip and I can try again [17:01:48] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) >>! In T199720#4502223, @Reedy wrote: > Yay, dependancies. Yeah. Note that the version of `luarocks` in stretch does depend on `unzip`, it's the jessie version that d... [17:05:29] 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Least `unzip` isn't a heavyweight dependancy :) [17:14:24] bblack: I've updated https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451838/ to avoid ugly templates :) [17:15:47] also instead of constantly getting the value of 'proxy.node.hostname' it seemed better to figure it out once and for all at init time [17:17:08] using /bin/hostname though given that the context of ts.mgmt.get_string is do_remap or later (ie it's not available yet at init) [17:18:47] now there's a little mystery I found when it comes to reloading trafficserver for lua scripts modifications: at first glance it seems that the script is actually really reloaded only if remap.config has changed (any change, even just a newline) [17:20:38] trafficserver logs this in that case: `NOTE: User has changed config file remap.config` [17:22:33] so yeah, further investigations ahead! [17:24:39] well touch remap.config does get the job done but I guess it's too early in our trafficserver adventure to start with hacks? :) [17:30:05] heh [17:30:23] so reload ignores lua-only changes? [17:30:33] interesting! [17:31:40] correct, I guess the reason being that trafficserver does not keep track of the modification of lua scripts that have been passed as @param to remap rules [17:32:05] it just checks if remap has changed, and only reloads it if that's the case [17:32:51] ok [17:33:03] it's not too awflu a hack to touch the remap files that reference the luas I guess [17:33:07] *awful [17:38:16] confirmed on #traffic-server: [17:38:20] <@amc> Yes. The file changed logic isn't very smart, so it doesn't know the lua script matters. [17:38:34] touch it is then! [17:39:17] it's an annoying problem to solve in general, I've seen it other places (the general-case problem of N input files which might include others recursively and which may change with each reload, and which or all should trigger reload if watching mtimes) [17:54:20] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Yesterday we had just under 20,000 requests for the copyright prot... [17:56:30] 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [17:57:17] 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) Task description updated with Chris's info so we have everything in 1 place. Switch ports configured accordingly. [18:31:24] bblack: CR amended with actual reloads upon lua changes [18:35:40] ok [18:35:51] the meetings, they never end [18:36:12] still? :) [18:36:51] I had a short break, but next one is coming! [18:37:31] good luck! [18:37:42] after that and poking at the Alexa changes, I'll probably want some tech work to go hide in later. Maybe I'll work on the cache_misc decom commits. Deleting code is the best :) [18:38:47] https://twitter.com/compscifact/status/761589015632052224 [20:02:58] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10matmarex) For reference, according to this thread, Polish Wikipedia was affe...