[04:03:08] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4048652 (10BBlack) See also the peering information tracked in T186835 [07:57:37] cp3040 again going crazy: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=now-3h&to=now [09:28:16] so, varnishslowlog is now showing some really useful stuff varnish-wise! [09:28:33] those waitinglist timings are very interesting [09:40:43] I don't think we currently have a cache_type field to filter on (text vs upload) [09:44:25] vs misc! query.wikidata.org also shows up a few times [09:52:50] <_joe_> ema: query.wikidata.org is expected to be slow [09:53:02] <_joe_> it's an interface for even very complex graph queries [09:54:34] yeah [09:54:59] that's why we should be able to say, for example, "only cache_text slow logs please" [09:55:32] I'm currently filtering on request-Host:en.wikipedia.org, but that's not optimal of course [09:55:58] I think I've managed to get enough kibana-fu to be able to save my current search [09:56:25] https://logstash.wikimedia.org/goto/56d8dca029b9dc5dd00b09038a033369 [09:57:06] it's unclear to me why the URL would need to be crazy long by default, and then you can choose to convert it into a decent one [09:57:51] We recommend sharing shortened snapshot URLs for maximum compatibility. Internet Explorer has URL length restrictions, and some wiki and markup parsers don't do well with the full-length version of the snapshot URL, but the short URL should work great. [10:21:22] this is helpful! https://www.digitalocean.com/community/tutorials/how-to-use-kibana-dashboards-and-visualizations [10:21:46] I'm staring at a vertical bar chart with more than one column, which is a great achievement already [10:39:48] https://logstash.wikimedia.org/goto/559fa61c92fed4482ca73fc930c49356 [10:40:13] visualizing enwiki varnish slow log, top-10 cache hosts [10:43:01] (very proud of that) [10:44:14] now what happens if I change the saved search [10:44:17] ema 1, kibana 0 [10:45:43] btw, if you need an excuse to stop looking kibana: https://gerrit.wikimedia.org/r/c/415260/ yesterday I deployed pybal 1.15.2 everywhere, so this it's ready to be merged :D [10:46:22] don't stop me now that I'm winning [10:46:45] haaha sure :D [10:46:53] OMG this works [10:46:58] I would have never expected that [10:47:12] I've changed the search criteria of my saved search [10:47:21] I've even renamed it, really looking for trouble [10:47:39] then I went back to the visualization, and it got updated with the new search [10:47:40] and the graph got updated [10:47:57] it even updated the saved search name [10:48:01] this is crazy [10:52:31] https://logstash.wikimedia.org/goto/bd0e849fde764fa338295d3009aa1af3 [10:53:02] so that's all the slow logs of the last 12h, regardless of request-Host [10:53:53] around 3AM you'll see that cp3040 starts being featured among the offenders [10:54:17] then all quiet for a while [10:54:26] and again at 6:40ish [10:54:44] but still kinda under control [10:55:02] then all of a sudden, at 07:10, the world goes on fire [10:55:39] with the worst-3 being cp3030, cp3040 and cp3032 [10:58:23] I've restarted cp3040 and cp3041 based on how long the backends had been running when trouble arose [10:58:35] which means that 3030 and 3032 self-recovered somehow [10:59:44] oh, 3032 was restarted 5 minutes ago by the cron bi-weekly restart [11:00:05] whereas 3030 has been running for 2 days now [11:00:34] (bi-weekly sounds like every 15 days for me, and it's driving me crazy, dumb spaniard comment) [11:01:34] we should be able to figure what's the different between a self-recovering varnish and one that needs the reboot extra-help [11:01:39] besides uptime :) [11:01:49] s/reboot/restart/ [11:02:06] yup, varnish restart, sorry [11:03:17] there are two important novelties this morning I think: (1) backends running for just 2 days being prominently featured in varnishslowlog (2) frontends piling up connections too [11:03:26] and both are scary [11:05:06] and I am starving, so see you in a bit :) [12:12:46] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches - I've added support for choosing the layer (frontend vs. backend) [12:13:38] the thing wasn't working earlier today because our $server template variable included the port number of the varnish backend prometheus exporter [12:13:53] so layer="frontend" was returning no results [12:14:28] solution: adding /(cp[0-9]{4})/ as the regex to be used by the $server template variable [12:14:58] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4049434 (10Pcoombe) The current plan is to move the thank you pages and some other fundraising specific pages... [12:15:06] and replacing instance=~"$server" with instance=~"($server):.*" in the prometheus queries [12:15:30] the latter would have taken me hours to figure out without godog's support :) [12:16:20] with this in place we can figure out what vgutierrez pointed out, namely fe<->be connections also piling up [12:17:14] neat [12:17:36] godog: the second graph there is unbearably slow [12:17:48] we need an aggregation rule [12:18:05] this is the query: `sum(varnish_backend_conn{layer="$layer", job="varnish-$cache_type",instance=~"($server):.*"}) by(instance)` [12:18:50] godog: do we need one aggregation rule for layer="frontend" and one for layer="backend" in order to speed things up? [12:21:19] why is layer="frontend" *much* slower than layer="backend"? [12:22:24] I assume there's some kind of caching involved somewhere :) [12:22:44] hehe yeah metrics are loaded from disk when requested, so it might be that [12:23:05] ema: wrt to the aggregation rule we can aggregate per layer, so one rule [12:25:57] godog: like this? instance:varnish_backend_conn:sum = sum(varnish_backend_conn) by(instance) [12:27:00] mmh no we probably need all labels? [12:27:22] so by(instance, layer, job) ? [12:28:01] instance_layer_job:varnish_backend_conn:sum = sum(varnish_backend_conn) by(instance, layer, job) [12:29:25] really confusing how we're using the term backend for varnish-be (layer) and varnish uses the same term for "origin server" [12:30:29] ema: yeah that rule LGTM [12:36:19] ema: btw, varnish-fe can communicate with a varnish-be that doesn't live in the same host? [12:36:37] vgutierrez: yes, but in the same DC [12:36:41] ack [12:36:51] godog: thank you! https://gerrit.wikimedia.org/r/419409 [12:36:53] dunno why I assumed that fe<-->be was always in ::1 [12:37:27] over simplification in one interview I guess [12:38:08] vgutierrez: the varnish-be is chosen by consistent hashing on the request url [12:38:43] makes sense [12:39:50] and ::1 is among the slots, so it could also be the chosen one [12:41:21] so the fe backend connections piling up could be related to the varnish-be backend connections piling up.. cascade/avalanche effect [12:41:21] this is the default behavior, we've also got random selection of backends in certain cases [12:41:55] // pass-traffic should not use consistent hashing, to avoid unecessary [12:41:58] // traffic focus on one node and keep things performant, *if* we're [12:42:01] // fairly sure that all layers/tiers make equivalent pass decisions... [13:10:01] bblack: anything against merging the no-thumb.php-to-imgscalers patch? https://gerrit.wikimedia.org/r/#/c/413185/ [13:10:29] there's a fair number of negations in that question [13:34:24] ema: catching up, various commentary inbound as I go [13:35:42] re logstash filtering, yeah. One thing I've found is that "is" and "is not" match substrings, so an easy way to kill cache_upload is to do: message is not upload.wikimedia.org [13:36:13] but aside from cluster, it would also be nice if the actual timestamp fields and maybe some of the headers were actually filterable (the timestamps, I guess only useful if they can do numeric ops like >) [13:36:31] (but even "exists" for the time-foo would help with e.g. finding waitinglist ones) [13:37:19] the right word for the new restart schedule is "semiweekly" :) [13:37:36] (biweekly even for native english speakers is ambiguous, but semiweekly isn't) [13:39:29] your call on thumb.php-to-imgscalers. it might be helpful, it might be something that causes new observations or perturbs ongoing observations. seems sane in general. [13:41:01] when I trawled the new logstash data, yeah, it's all very interesting but also perplexing in places [13:41:07] bblack: mainly North American according to the dictionary... but way better than biweekly: "appearing or taking place every two weeks or twice a week:" [13:41:19] * volans propose to call the restart pi-days, is close enough and today is \pi day :D [13:41:51] LOL [13:41:54] I think enwiki/Special:Random is an easy target to focus on though. it's a simple one with a consistent URL, it has a well-defined behavior, it never has a long apache D= delay, and it seems to get fetch and/or waitinglist slowly entries fairly regularly. [13:42:47] I think if we find out what's causing that case, it might fix or at least illuminate other more-complicated ones [13:44:10] ema: also I don't know if you saw all of yesterday backscroll, but keep in mind that apache D= isn't as inclusive as I had naively hoped before. It technically only covers time to (just before, I guess) outputting response *headers*. proxying the body happens after that delay measurement (duh, since it's a header we get the value from, which is already sent before the body...) [13:44:32] so there is a possibility that a low D= value could be couple with an brokenly-long time sending all of the output back to varnish. [13:45:06] -- end of mental dump -- [13:45:09] yeah I've seen that [13:45:27] timing backend response is much harder than you'd think :) [13:45:33] very! [13:46:11] I think getting lvs1016 into place (hopefully next week) to supplant lvs1003 will clear up some of the ?? in the path between v<->apache though, at least remove some variables. [13:46:20] do we have any kind of HHVM/apache instrumentation? [13:46:34] something like new relic? [13:47:09] aside from S:R above, then yes there's also a lot of query.wikidata.org from cache_misc. ignoring those, there's various random-ish entries, and also a fair amount of RB entries that cutt off at ~60s when RB times out I guess. [13:48:19] I think it's possible that "resp" should be excluded on backend slowlogs as well. my initial interpretation of what I've seen is some entries only have a slow-but-not-crazy backend resp time, and that it's because it's a fully-streaming transfer through a frontend at client speeds (because only one parallel client, and it's a slow one) [13:48:46] I'm not 100% sure about that, but the ones with only a backend Resp slowtimer seem less-interesting by far, and probably are the above. [13:49:24] aside from chasing S:R in detail, most likely what we're looking for out of the rest of this haystack is a clear sign that some other slow responses might be piling up and then inducing the spike of random slow/fail. [13:50:13] e.g. the failure-mode is preceded by a burst of 60s RB timeouts sucking up connections/threads. of course if they take a while to timeout and report, it may appear in varnishlog/slowlog backwards (first see the fallout, then the initial slow ones finally log as they clear up?) [13:50:24] -- out for now! [13:50:31] o/ [13:51:59] oh one last thing - a lot of the RB timeouts I observed were for the PDF-renderer. might peek at that in general too. [13:52:06] I think it's like /api/rest_v1/pdf/ or something [13:55:09] btw, the time frame is not important at all? [13:55:20] why always happens at 6 - 9 UTC? [13:56:37] heh [13:57:53] as I said earlier on, there's evidence in the slowlog kibana visualization of slow requests piling up on one host even at other times (3AM today) [13:58:18] but they do tend to self-resolve more quickly and to only affect 1-2 hosts instead of the whole esams army [14:37:02] do we have documented somewhere in wikitech all the steps and their order to add a new public service to cache misc and DNS? [14:44:47] hello people, as FYI we are migrating all the eventlogging's varnishkafka instances to kafka jumbo [15:08:35] volans: I don't think we do, no [15:09:05] elukey: \o/ [15:09:46] ema: ack, then by any chance any of you could walk me through it? [15:12:31] volans: 1) hieradata/role/comon/cache/misc.yaml - add a new director pointing to an existing backend (if it exists) 2) in DNS, template/wikimedia.org add a new name and point it to "600 IN DYNA geoip!misc-addrs" [15:12:47] volans: commit 00b2a7531a97aaf4df02626f646a6c306c54b41f in puppet for the lvs part (or similar ones for other services) [15:12:57] that would be the easiest one for "misc service in wikimedia.org using existing backend" [15:13:15] maybe that's not what you meant [15:13:51] mutante: thanks, those are new backends being a new service but that would do [15:14:25] I was wondering when the LE part can done without exposing the service before that [15:14:30] *can be done [15:16:30] that i don't know, but when i did services they were all websites, so i would set them up and have DNS and Apache config just no actual docroot content yet [15:22:10] ack [15:25:09] 10netops, 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4050134 (10Marostegui) [15:26:23] and ofc in this case I don't need LE, being *.wikimedia.org already included in the global cert [15:27:18] 10netops, 10Operations, 10ops-codfw: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452#4050146 (10ayounsi) 05Open>03Resolved Optic replaced by Papaul, will re-open if the errors shows up again after a few hours. [15:30:06] so this is interesting [15:30:07] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521005558744&to=1521020429908&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend [15:30:13] (from the frontend perspective) [15:30:33] 2nd and 4th graph will take forever to load but be patient [15:30:59] the number of incoming frontend connections does not vary significantly [15:31:23] but the number of fe<->be connections does (they "pile up" as we're saying) [15:32:02] which means that this is varnish-fe opening up new connections to its backends [15:32:50] why is this happening? We do retry 503s only once [15:35:36] also, connections should be re-used [15:39:49] the more I look into this, the more I suspect that our extrachance patch might be involved [15:54:17] https://gerrit.wikimedia.org/r/#/c/419460/ [15:54:49] it's arguably better to use upstream fixes instead of our own anyways :) [15:56:12] this in particular smells funny [15:56:13] https://github.com/varnishcache/varnish-cache/commit/e0fd9ba67e5234b167214658ba83070cb1b5a3ed#diff-791ad1d4b819c72390dececd07739432L368 [15:58:32] so forcing a fresh connection is only set if extrachance==0, whereas before it was always the case FWIU [15:58:57] https://github.com/varnishcache/varnish-cache/commit/a8da6ead25479f20acfc23f85e90c1629bfba486#diff-8dfab2c266cb7e7270373622c65bcd86R200 [16:16:00] why aren't we plotting retries yet on varnish-failed-fetches? :) [16:20:55] added [16:20:59] https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-24h&to=now [16:21:07] failed fetches and retries do match [16:22:04] and each retry means a new call to vbe_dir_getfd (which opens up a backend connection) [16:28:19] hmmm does varnish provide any sort of circuit breaking? [16:28:41] (I'm kinda trapped in the HR cultural orientation) [16:29:18] I don't know what either of those sentences mean :) [16:29:47] regarding circuit breaking: https://martinfowler.com/bliki/CircuitBreaker.html [16:32:35] no, there's no specific "object" taking care of avoiding remote calls if shit hits the fan as far as I'm aware [16:33:54] this is a usual approach in microservices, you mark your backend and faulty and you don't hit it for a while [16:34:02] *as faulty [16:36:52] at any rate, timing-wise all this seems to be working as follows: [16:37:04] 1) connections start "piling up" [16:37:16] 2) fetches being to fail [16:37:29] 3) retries start happening [16:39:35] see the example of cp3030's backend this fine EU morning [16:39:36] https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520991928627&to=1520992788386&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-layer=backend [16:42:30] hmmm that should impact on the mw servers, right? [16:42:49] this is esams<->eqiad cache backends [16:43:46] then eqiad cache should scream a little bit at least with those peaks [16:45:51] what I'm trying to say, it's that we should avoid varnish hitting like crazy on its backends [16:47:48] right, yes, it's not that we're doing it strictly for the lulz :) [16:48:11] rofl [16:48:20] we're trying to a) figure out why this is happening b) stopping this from happening [16:51:56] so there's something interesting [16:51:58] https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521003941848&to=1521021873571&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=7&fullscreen [16:52:37] we had a big spike in incoming connections *rate* to eqiad backends today between 6:12 and 6:31 [16:53:30] after that, esams backends begin their morning routine of craziness [16:53:34] https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521005656952&to=1521033472992&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [16:54:13] where did that incoming spike of connections in eqiad come from? Is the timing just a coincidence? [16:56:15] not from codfw backends [16:56:26] not from esams backends [16:56:33] and not from eqiad frontends [16:56:46] which to the best of my knowledge where the only possible options [16:56:51] s/where/were/ [16:57:31] so essentially every time I find something new my understanding of the issue decreases [17:00:50] saturday incident was around 13:30 UTC? [17:01:16] huge spike on eqiad as well: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520677809395&to=1520712826993&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=7&fullscreen [17:03:12] vgutierrez: 13:30 was one of the largest spikes yes [17:03:17] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=1520685019110&to=1520693876519 [17:06:00] 10Wikimedia-Apache-configuration, 10Operations, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4050615 (10fgiunchedi) See also https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxystatus [17:09:07] so, two (I think) new observations today, staring at the graphs: [17:09:56] there's an interesting timing of events around fe<->be failed fetches/retries/reconnections. First we get connections piling up, then fetches fail, and only later retries happen [17:10:19] -- that was observation (1) [17:11:07] (2) we've had a sudden spike in incoming backend connections to eqiad, which ended right before esams troubles began [17:11:39] (2b) I have no idea where those connections came from [17:12:22] (1) https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520991928627&to=1520992788386&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3030&var-layer=backend [17:12:45] (2) https://grafana-admin.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521003941848&to=1521021873571&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=7&fullscreen [17:13:34] o/ [17:18:26] 10Wikimedia-Apache-configuration, 10Operations, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4050672 (10fgiunchedi) Including something like this might help as well `LogLevel warn proxy:info proxy_http:info proxy_balancer:info` [18:30:09] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4051066 (10Varnent) >>! In T188776#4046204, @jhsoby wrote: >>>! In T188776#4021634, @Varnent wrote: >>>>! In T... [19:40:13] 10Traffic, 10Operations, 10Performance-Team, 10Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4051353 (10Krinkle) [21:16:27] 10Traffic, 10Cloud-Services, 10Cloud-VPS, 10Operations: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#4051724 (10Andrew) 05Open>03declined This is moot, californium is moving into the spare pool. [21:38:57] 10Traffic, 10Cloud-Services, 10Cloud-VPS, 10Operations: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2223542 (10bd808) It's slightly off topic with californium being decommed, but things that are allocated for Cloud Services infrastructure are generally being moved into... [22:53:20] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4051974 (10ayounsi) [23:11:37] 10netops, 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4051996 (10ayounsi) First, the questions and troubleshooting commands listed on https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue are useful to us. Espec... [23:12:20] 10netops, 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4051997 (10ayounsi) a:03Samtar