[02:02:59] <wikibugs_>	 10Traffic, 10Operations, 10TechCom-RFC, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10CCicalese_WMF)
[02:22:27] <wikibugs_>	 10Traffic, 10Operations, 10Performance-Team: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Imarlier) 05Open>03Resolved Confirmed that WPT agents are resolving to the codfw edge.  Given that this means that they're goin...
[04:44:05] <wikibugs_>	 10Traffic, 10Operations, 10TechCom-RFC, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Joe) We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they...
[08:09:16] <vgutierrez>	 ema: bblack gave me a +1 here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450020/ are you ok with merging this?
[08:12:43] <ema>	 vgutierrez: looks good, yes
[08:13:27] <vgutierrez>	 ack, thx
[08:15:34] <ema>	 legoktm: thanks for the review!
[08:15:55] <legoktm>	 np :)
[08:20:52] <ema>	 legoktm: oh, but that's a jessie image
[08:21:10] <ema>	 lua-busted isn't available in jessie, sadness
[08:23:14] <legoktm>	 ema: we could either backport it in apt.wm.o, move the image over to stretch (has to be done at some point anyways), or worst case, create a second docker image based on stretch and run two jenkins jobs
[08:23:58] <ema>	 legoktm: alternatively, we can install luarocks (which is in jessie) and install busted with it
[08:24:17] <ema>	 that's not very different from what we do with pip after all
[08:24:19] <legoktm>	 options! :)
[09:42:30] <ema>	 mutante: hey, that should not happen (of course!) :)
[09:42:46] <ema>	 mutante: if you see the value staying relatively low just let us know please
[09:43:06] <ema>	 mutante: if it goes up to the sky very quickly it might be a good idea to depool the host
[10:20:44] <_joe_>	 ema: if it's a docker container go that way
[10:21:10] <_joe_>	 the general consensus about docker images is "dump whatever shit into it, it's a container!"
[10:21:38] <_joe_>	 or you can backport lua-busted <g>
[10:23:37] <ema>	 _joe_: so containers *are* great after all!
[13:23:18] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5005.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808...
[14:10:44] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5005.eqsin.wmnet'] ```  and were **ALL** successful.
[14:18:25] <ema>	 bblack: cp108[79] are currently affected by issues similar (varnishlog-wise) to those we see after a few days of runtime, but I've just restarted them
[14:19:05] <ema>	 --  Fetch_Body     3 length stream
[14:19:05] <ema>	 --  ExpKill        LRU_Fail
[14:19:05] <ema>	 --  FetchError     Could not get storage
[14:22:43] <ema>	 cp1089's backend depooled as it kept on failing fetches
[14:23:23] <ema>	 see https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1534254052664&to=1534256587399&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend
[14:25:10] <bblack>	 ok
[14:25:14] <ema>	 errors logged on cp1089.eqiad.wmnet:/root/503.log 
[14:26:03] <ema>	 on both there's been at least one child crash soon after restart
[14:26:26] <bblack>	 "those we see after a few days of runtime" you mean in the new-eqiad case specifically right?
[14:26:44] <ema>	 yes
[14:27:25] <ema>	 interestingly all the "Could not get storage" happen on transient memory
[14:28:59] <bblack>	 right I was going to say, this is clearly a miss->pass case without warming
[14:29:04] <bblack>	 so it's not about the nvme I guess
[14:29:30] <bblack>	 it even says it's malloc transient
[14:29:41] <bblack>	 how does it LRU_Fail on transient? :P
[14:30:08] <bblack>	 I mean, I guess this is a consequence of us putting specific transient limits in place on be's
[14:30:20] <bblack>	 but still, it should be able to evict to make room :P
[14:30:24] <ema>	 yeah 
[14:30:45] <bblack>	 looking at the 10,000 graphs of varnishy things on that box
[14:30:53] <wikibugs_>	 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I added sfp-t's to asw2-a5-eqiad for the new server in that rack. For the remainder of the 10G servers in rack's 2/4/6 do you want me to run cross connects to asw2-a5?...
[14:31:17] <ema>	 bblack: I was currently looking at those and found that we need graph 10,001 (varnish-be transient storage usage)
[14:31:21] <bblack>	 heh
[14:31:26] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-3h&to=now&panelId=21&fullscreen&var-server=cp1089&var-datasource=eqiad%20prometheus%2Fops
[14:31:39] <bblack>	 ^ is that graph right, with the markers for like 16 backend restarts?
[14:32:36] <bblack>	 yeah it was crashing
[14:32:57] <bblack>	 Aug 14 14:17:40 cp1089 varnishd[174266]: Child (176838) Panic at: Tue, 14 Aug 2018 14:17:40 GMT
[14:33:00] <bblack>	                                          Missing errorhandling code in vrb_pull(), cache/cache_req_body.c line 74:
[14:33:03] <bblack>	                                            Condition((STV_NewObject(req->wrk, req->body_oc, stv, 8)) != 0) not true.
[14:34:02] <bblack>	 and all the crash-trace reqs are POST to api.php
[14:34:22] <bblack>	 maybe there's some intersection of the fixup for vcl-switching + POST and miss->pass?
[14:34:35] <bblack>	 (or vclswitch+POST+out-of-transient)
[14:35:22] <bblack>	 yeah, so
[14:35:52] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&panelId=69&fullscreen&from=1534254170493&to=1534256641459
[14:36:01] <ema>	 backend transient storage usage ^
[14:36:01] <bblack>	 (a) eqiad will have the most be transient usage probably, since it's the backend-most where we do miss->pass a lot.
[14:36:29] <bblack>	 (b) adding cache_misc into cache_text probably bumped transient usage in the be's in general, because of all the pass-mode backends there?
[14:36:53] <bblack>	 (c) maybe now we're seeing the limit case due to both of the above and need to raise our be transient limits?
[14:37:32] <bblack>	 I mean I'm guessing LRU_Fail is because all the transient storage (or most anyways) is being taken up by live passes that can't be evicted
[14:37:47] <bblack>	 (including long-term entries perhaps, for etherpad/phab websockets?)
[14:38:07] <ema>	 definitely need to raise be_transient_gb
[14:38:36] <ema>	 (2G is the usage peek in the graphs)
[14:38:47] <bblack>	 we never did set up one for upload I guess
[14:38:54] <ema>	 luckily!
[14:39:32] <bblack>	 maybe we should just flip it back to zero and roll through some eqiad restarts (there's almost nothing in actual caches there anyways)
[14:39:37] <bblack>	 and see what the natural numbers look like without caps
[14:39:40] <ema>	 +1, on it
[14:39:50] <bblack>	 thanks!
[14:42:00] <ema>	 bblack: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452680
[14:45:35] <ema>	 grr, we don't have a hiera defaul
[14:45:42] <ema>	 will set to 0 instead
[14:47:07] <bblack>	 oh I was about to say then how is upload working?
[14:47:13] <bblack>	 but then I realized the whole setting is text-only :)
[14:48:00] <mutante>	 ema: ok! thanks
[14:53:13] <ema>	 Overuse of Transient storage with hit-for-miss objects: https://github.com/varnishcache/varnish-cache/issues/2654
[14:53:17] <ema>	 that seems interesting
[14:57:26] <bblack>	 lol
[14:57:30] <ema>	 :)
[14:57:45] <bblack>	 sounds like hfm objects are storing the whole body in transient :P
[14:57:53] <bblack>	 (based on those test stats)
[14:59:14] <ema>	 I'll begin with the eqiad backend restarts (starting with nodes with crashed child)
[14:59:31] <bblack>	 ok thanks
[14:59:51] <bblack>	 I'm going into meeting mode soon, for a couple hours
[15:01:03] <ema>	 k
[15:20:29] <ema>	 _joe_: so the luarocks thing seems ok? https://gerrit.wikimedia.org/r/#/c/integration/config/+/452634/
[15:21:35] <_joe_>	 ema: yes, but you also need to bump the version in the changelog
[15:21:51] <_joe_>	 if this is the end of your work on that, I mean
[15:22:01] <ema>	 I certainly hope so
[15:22:11] <_joe_>	 it's in debian format :)
[15:26:01] <ema>	 _joe_: I see you've specified UNRELEASED in the distro field instead of wikimedia for 0.3.2, does anyone/anything care?
[15:26:28] <_joe_>	 ema: I don't think anyone does
[15:26:33] <_joe_>	 not for now at least
[15:26:55] <_joe_>	 in the future we could think to control sub-namespaces via control file tags maybe
[15:29:38] <ema>	 _joe_: CR updated with new changelog entry
[15:31:26] <moritzm>	 reprepro will complain, but it can be overriden
[15:32:03] <_joe_>	 moritzm: this is a docker image :P
[15:32:31] <_joe_>	 ema: we should really update that container to be based on stretch
[15:32:34] <moritzm>	 ah, ok :-)
[15:32:50] <_joe_>	 and btw, we should start rebuilding the base images regularly
[15:38:04] <ema>	 bblack: fun times! https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=varnish-text&var-layer=backend&from=1534251299800&to=1534260945025
[15:38:31] <ema>	 bblack: it looks like getting rid of the cap helped :)
[15:57:53] <wikibugs_>	 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10Cmjohnson) @ayounsi I pre-cabled everything. The lvs cross connects only need to move racks to the new switch. We probably need to do those 1 at a time, because downtime may be close to 1m...
[15:59:34] <ema>	 bblack: I'd say there's no need to restart the backends elsewhere to apply the change, we can wait for weekly restarts?
[16:07:00] <bblack>	 ema: +1
[16:16:55] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) ``` 2018-08-14 16:01:08,650 [docker-pkg-build] INFO - Generated dockerfile for docker-registry.discovery.wmnet/releng/operations-puppet:0.3.3: FROM docker-registry.d...
[16:57:42] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) Thanks @Reedy! The `luarocks` part fails with:  ``` Warning: Failed searching manifest: Failed extracting manifest file Installing https://raw.githubusercontent.com/ro...
[16:59:36] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Yay, dependancies.  Feel free to bump the package again and add unzip and I can try again
[17:01:48] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) >>! In T199720#4502223, @Reedy wrote: > Yay, dependancies.  Yeah. Note that the version of `luarocks` in stretch does depend on `unzip`, it's the jessie version that d...
[17:05:29] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10Reedy) Least `unzip` isn't a heavyweight dependancy :)
[17:14:24] <ema>	 bblack: I've updated https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451838/ to avoid ugly templates :)
[17:15:47] <ema>	 also instead of constantly getting the value of 'proxy.node.hostname' it seemed better to figure it out once and for all at init time
[17:17:08] <ema>	 using /bin/hostname though given that the context of ts.mgmt.get_string is do_remap or later (ie it's not available yet at init)
[17:18:47] <ema>	 now there's a little mystery I found when it comes to reloading trafficserver for lua scripts modifications: at first glance it seems that the script is actually really reloaded only if remap.config has changed (any change, even just a newline)  
[17:20:38] <ema>	 trafficserver logs this in that case: `NOTE: User has changed config file remap.config`
[17:22:33] <ema>	 so yeah, further investigations ahead!
[17:24:39] <ema>	 well touch remap.config does get the job done but I guess it's too early in our trafficserver adventure to start with hacks? :)
[17:30:05] <bblack>	 heh
[17:30:23] <bblack>	 so reload ignores lua-only changes?
[17:30:33] <bblack>	 interesting!
[17:31:40] <ema>	 correct, I guess the reason being that trafficserver does not keep track of the modification of lua scripts that have been passed as @param to remap rules
[17:32:05] <ema>	 it just checks if remap has changed, and only reloads it if that's the case
[17:32:51] <bblack>	 ok
[17:33:03] <bblack>	 it's not too awflu a hack to touch the remap files that reference the luas I guess
[17:33:07] <bblack>	 *awful
[17:38:16] <ema>	 confirmed on #traffic-server:
[17:38:20] <ema>	 <@amc> Yes. The file changed logic isn't very smart, so it doesn't know the lua script matters.
[17:38:34] <ema>	 touch it is then!
[17:39:17] <bblack>	 it's an annoying problem to solve in general, I've seen it other places (the general-case problem of N input files which might include others recursively and which may change with each reload, and which or all should trigger reload if watching mtimes)
[17:54:20] <wikibugs_>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Yesterday we had just under 20,000 requests for the copyright prot...
[17:56:30] <wikibugs_>	 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi)
[17:57:17] <wikibugs_>	 10netops, 10Operations, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) Task description updated with Chris's info so we have everything in 1 place. Switch ports configured accordingly.
[18:31:24] <ema>	 bblack: CR amended with actual reloads upon lua changes
[18:35:40] <bblack>	 ok
[18:35:51] <bblack>	 the meetings, they never end
[18:36:12] <ema>	 still? :)
[18:36:51] <bblack>	 I had a short break, but next one is coming!
[18:37:31] <ema>	 good luck!
[18:37:42] <bblack>	 after that and poking at the Alexa changes, I'll probably want some tech work to go hide in later.  Maybe I'll work on the cache_misc decom commits.  Deleting code is the best :)
[18:38:47] <ema>	 https://twitter.com/compscifact/status/761589015632052224
[20:02:58] <wikibugs_>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10matmarex) For reference, according to this thread, Polish Wikipedia was affe...