[00:02:01] 10netops, 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4050109 (10Platonides) In fact, those timeouts appear even on the early hops. The ticket also includes a successful tcp traceroute, and no explicit mention of timeouts. T... [02:16:49] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052301 (10ayounsi) [02:26:05] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052304 (10ayounsi) Updated the task description for small pacific islands. The methodology is the following. Made possible by the fact tha... [02:47:25] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052309 (10ayounsi) [05:37:28] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052382 (10ayounsi) [05:52:05] THIS IS A FREENODE BREAKING NEWS ALERT!! Hitechcg AND opal ARE GOING AT IT RIGHT NOW WITH A LOT OF FIGHTING AND ARGUING WOW YOU DON'T WANT TO MISS THIS!! TYPE /JOIN ## TO SEE THE ACTION...AGAIN TYPE /JOIN ## TO SEE THE ACTION!! [05:52:09] stashbot godog volans Sveta bblack Krenair legoktm TheDragonFire gilles mdholloway _joe_ phedenskog musikanimal MaxSem elukey ema vgutierrez wmopbot SQL Platonides moritzm Snorri dcausse paravoid CIA Danny_B mutante wm-bot4 AlexZ bd808 SMalyshev SpydarOO7 eddiegp Ivy xenol Pchelolo robh mark paladox chasemp Matthew_ madhuvishy no_justification Steinsplitter marlier XioNoX wikibugs gehel greg-g [05:52:09] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [05:54:12] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4052383 (10ayounsi) [08:25:32] 10netops, 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4052490 (10Samtar) @ayounsi I'm happy to pass on comments and suggestions to the OTRS ticket. @Platonides Please feel free to update the task description to something mor... [10:33:21] I've moved all our packages from jessie-wikimedia/experimental to jessie-wikimedia/main and uploaded varnish 5.1.3-1wm4 (upstream extrachance fixes) to main [10:35:01] note that 5.1.3-1wm4 removes support for `-p gethdr_extrachance`, so we'll need to get rid of that from the systemd units before restarting varnish{,-frontend} [10:35:54] otherwise they will stop but not start, which is not great :) [10:36:09] pu upgraded [14:19:12] <_joe_> how long do we cache content on cache-misc? [14:22:24] <_joe_> context is: docker-registry.wikimedia.org/v2/python3-build-stretch/manifests/latest is cached since quite a long time, it seems [14:22:34] <_joe_> and the cache won't expire [14:29:04] hieradata/role/common/cache/misc.yaml [14:29:04] 307: - default_ttl=3600 [14:29:04] 309: - default_ttl=3600 [14:29:26] 3600 is the default ttl for both fe & be [14:30:03] of course that could be overwritten by the origin server [14:31:00] (disclaimer: this is a best effort answer from the newbie guy) [14:40:41] so yeah, that default_ttl is set as a varnish runtime parameter (see systemctl cat varnish) [14:41:17] however we also cap TTLs in VCL in case the ttl set by the origin server is higher than a certain value [14:41:38] modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb:419: if (beresp.ttl > <%= @vcl_config.fetch("ttl_cap", "1d") %>) { [14:41:44] modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb:420: set beresp.ttl = <%= @vcl_config.fetch("ttl_cap", "1d") %>; [14:42:33] <_joe_> vgutierrez: ok thanks [14:42:52] <_joe_> so I guess in this case the caching is 1 hour [14:43:00] <_joe_> which is way too high for those data [14:43:12] <_joe_> I need to add an header to the response from nginx on the docker registry [14:44:43] _joe_: that's the right way to control how long responses are cached, yes [14:44:46] ema: our cache doesn't leverage the etag header to know if the content is still fresh? [14:44:59] <_joe_> vgutierrez: nope [14:45:02] ack [14:45:03] <_joe_> or that would be ok [14:45:13] yup.. that's why was I was asking [14:45:19] I saw the Etag sha256 header there [14:45:21] <_joe_> also, we mangle badly the admittedly weird etag that the docker registry emits [14:45:31] <_joe_> anyways [14:50:27] vgutierrez: so, varnish does support conditional requests [14:50:33] see for example modules/varnish/files/tests/upload/03-backend-if-cached.vtc [14:51:39] I guess we're not talking about conditional requests now though [14:52:57] <_joe_> ema: I tried to add a -H 'If-None-Match: lalala' [14:55:50] $ curl -v -H 'If-None-Match: W/"sha256:ea9365bb6056d56644789585446d23cbbed38d1380fb96dbfe3f5c5028846a67"' https://docker-registry.wikimedia.org/v2/python3-build-stretch/manifests/latest 2>&1 | grep '< HTTP/2' [14:55:56] < HTTP/2 304 [15:55:17] BTW, regarding pybal verbosity [15:56:09] what is flooding us are the "New enabled server" messages [15:56:29] example: Feb 22 09:06:31 lvs1006 pybal[26025]: [search_9200] INFO: New enabled server elastic1040.eqiad.wmnet, weight 10 [15:56:43] vgutierrez@lvs1006:~$ grep "09:06:31" pybal.log.15 |grep "New" |wc -l [15:56:43] 121 [15:56:43] vgutierrez@lvs1006:~$ grep "09:06:30" pybal.log.15 |grep "New" |wc -l [15:56:44] 311 [16:00:06] * volans wonders why we log 'new' servers so often [16:01:11] if we're talking about coordinator.py:233 [16:01:26] https://github.com/wikimedia/PyBal/blob/master/pybal/coordinator.py#L232-L234 [16:01:30] that little bastard :) [16:01:31] right [16:01:49] volans: well... when pybal restarts, every server is a new server :) [16:02:13] you mentioned verbosity, not verbosity at restart ;) [16:02:25] so I was not understanding... sorry, missing context :D [16:02:52] yey.. pybal is quite quiet after a restart [16:03:00] but the restart is noisy as hell [16:03:12] so much in fact that we trigger systemd/journald rate limitting [16:03:15] and funny stuff happens [16:07:02] as ema suggested during the traffic weekly meeting I think that some summarization is in order to avoid that flooding [16:07:10] vgutierrez: my 2 cents, what we could do is to detect that is a restart (empty existing config) and log differently, maybe just one line per cluster with the number of enabled/disabled hosts as info and the detailed log as debug [16:07:25] indeed [16:07:34] I was thinking exactly that [16:07:42] logging X servers enabled, Y disabled [16:08:17] I'll code this after getting even more oriented than yesterday [16:09:08] rotfl, sound good for me, you can use Logger.log(level, message) to avoid repeating the same line and log it with debug or info based on the situation [16:12:17] * volans lend a compass to vgutierrez to help the orientation [16:12:31] :) [16:18:27] what [16:18:35] bending pybal to the wishes of systemd [16:18:44] * vgutierrez hides behind ema [16:18:47] * mark thunders [16:18:53] no, to readable logs [16:18:55] :-P [16:19:00] * volans run away [16:20:20] but yeah.. +1 to readable logs [16:20:36] they're very very readable [16:20:39] you just need time to do so! [16:20:57] lol [16:21:03] rofl [16:21:36] your production environment is down, please give 10 minutes to go through the logs VS hey.. it's down and I've a pretty concise log saying what's wrong [16:21:53] option 2 for me plz [16:22:39] what do you mean. it's never down! [16:23:00] oh, then we don't need logs at all [16:23:06] so let's disable them [16:23:10] indeed [16:23:23] we used to send all logs to /dev/null [16:25:04] (+1 to more concise logs) [16:26:45] (+1 to valentin's revenge on systemd log eating also) [16:30:09] be careful, journald also adds rate limiting to /dev/null !!1! [16:30:47] and who rate limit the rate limiter? [16:35:13] what was really wrong in all that is that journald did not say anything about the rate limiting itself [16:35:55] you want it to log that is logging too much? ofc it rate-limited itself too [16:36:06] to not harm your delicate disks [16:36:09] * volans hides [16:36:44] I want it to do something sane like saying (once): the last 342 messages have been dropped :) [16:38:11] heh [16:38:18] Mar 13 15:43:07 lvs1006 systemd-journal[252]: Suppressed 75 messages from /system.slice/pybal.service [16:38:21] Mar 15 13:18:38 lvs1006 systemd-journal[252]: Suppressed 189 messages from /system.slice/pybal.service [16:38:29] (from sudo journalctl -u systemd-journald.service) [16:39:07] those lines should have ended up in pybal's logs though I think [16:44:36] yup [16:44:40] they ended there [16:45:10] that's how I discovered that the bug was not a bug [16:48:46] vgutierrez: there=systemd-journal? [16:48:57] there=pybal's log [16:49:09] aka /var/log/pybal.log [16:49:40] ooh [16:50:01] aka `journalctl -u pybal.service`? [16:50:33] is not the same, right? [16:51:05] on journalctl -u pybal the BGP: ESTABLISHED line wasn't there [16:51:18] but on /var/log/pybal.log.15.gz it was [16:51:36] vgutierrez@lvs1006:~$ grep ESTABLISHED pybal.log.15 [16:51:36] Feb 22 09:04:21 lvs1006 pybal[24481]: [bgp] INFO: State is now: ESTABLISHED [16:51:39] Feb 22 09:06:30 lvs1006 pybal[26025]: [bgp] INFO: State is now: ESTABLISHED [16:51:42] Feb 22 09:13:15 lvs1006 pybal[30239]: [bgp] INFO: State is now: ESTABLISHED [16:51:45] Feb 22 13:17:03 lvs1006 pybal[11183]: [bgp] INFO: State is now: ESTABLISHED [16:51:48] those :D [16:51:50] sure, that I know [16:51:56] my question is whether the "Suppressed N messages" log lines ended up in pybal.service's journal [16:52:02] ohh [16:52:11] I don't think so [16:52:20] it's a journald log message, not a pybal service log message [16:52:58] right, but it's a journald message saying that it has dropped pybal.service log entries [16:53:11] which would be good to know when going through pybal logs [16:53:13] vgutierrez@lvs1006:~$ grep Suppressed pybal.log.15 |wc -l [16:53:14] 0 [16:53:15] :_( [16:54:03] pybal.log is written by rsyslog though, not by journald (I think) [16:54:23] that's a good thing [16:54:32] cause journald rate limitting doesn't apply there [16:54:40] yes [16:55:03] s/limitting/limiting/g [16:56:29] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4037747 (10ema) The most surprising thing about journald's rate limiting, in the context of the issue that triggered this task, is that by looking at pybal logs we... [16:58:36] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4053749 (10MoritzMuehlenhoff) >>! In T189290#4053740, @ema wrote: > It would have been much more useful to get such messages into `journalctl -u pybal.service`'s ou... [17:01:04] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4053757 (10Vgutierrez) yup... or at least a warning on `systemctl status pybal` similar to the one that you get when the journal has been rotated [17:45:07] so I had a crazy idea re: "connections piling up" [17:45:31] what if all of a sudden something triggers a bunch of "Connection: close" responses? [17:46:24] because that's a very good way to make varnish close the connection to its origin server, triggering new connection establishments to serve incoming request [17:46:28] *requests [17:49:31] also, any specific reason for sending Connection: close in our varnishcheck requests? [17:54:42] hmmm wait, a Connection: close header on a client request gets the varnish to close its connection to the origin server? [17:55:22] I guess that varnishcheck is just being polite [17:55:34] HTTP/1.1 defines the "close" connection option for the sender to signal that the connection will be closed after completion of the response. [17:55:41] from: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html [17:56:05] also.. HTTP/1.1 applications that do not support persistent connections MUST include the "close" connection option in every message. [18:00:39] I guess you mean a connection: close on the origin<-->varnish connection :) [18:05:07] 1) responses with 'Connection: close' from an origin server cause varnish to close its connection (of course). Are we getting any/lots of those at times? [18:05:26] 2) varnishcheck is not using persistent connections. Should it? [18:05:26] :) [18:09:03] (those are two separate questions, the 2nd one being asked out of curiosity, the 1st one would be good to know in the context of "the piling-up thing") [18:10:36] gotta go o/ [18:10:43] regarding 1, the backend_reuse should decrease dramatically if that happens [18:11:06] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=27&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=now-2d&to=now [18:11:10] that graph [18:25:24] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#4054146 (10RobH) 05Open>03Resolved [18:25:27] 10netops, 10Operations, 10fundraising-tech-ops, 10ops-codfw, 10Patch-For-Review: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#4054147 (10RobH) [18:25:40] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3624959 (10RobH) [18:28:15] 10netops, 10Operations, 10fundraising-tech-ops, 10hardware-requests, 10ops-codfw: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#4054160 (10RobH) So we cannot actually blank the port description on the scs once its been set; I set them each to 'empty port #' with their... [19:27:00] also.. it could be interesting to graph backend_reuse / backend_conn, to see the ratio between the number of connections and number of reused connections [20:29:35] so I've added backend_reuse to varnish-failed-fetches and replaced the connection sum with the rate of successful connections [20:29:41] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521004528691&to=1521023239862&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-layer=backend [20:30:37] at 7:38 the successful connection rate goes up to 60/s (normally ~2/s) [20:31:36] and reuse goes down to 230/s (normally ~400/s) [20:33:47] interesting data [20:33:49] gotta go again :)