[06:06:53] 10Traffic, 10Operations: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) p:05Triage→03Normal [06:18:12] 10Traffic, 10Operations: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) I cannot reproduce the issue right now, but it does look like a strange interaction between the application servers and varnish. >>! In T226776#5290831, @CD... [06:24:47] 10Traffic, 10Operations: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10TheDJ) >>! In T226776#5291441, @ema wrote: > Why did you conclude that? the second after I refreshed after: > <+wikibugs> (CR) Andrew Bogott: [C: +2] nova-full... [06:24:56] 10Traffic, 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ema) @Trizek-WMF: personally, I've t... [06:33:57] 10Traffic, 10Operations: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) >>! In T226776#5291445, @TheDJ wrote: > the link started working again. Two things had changed at that time, the thing above and I had logged out and back i... [07:09:13] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2005.codfw.wmnet'] ` The log can be found in `... [07:23:20] 10Traffic, 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Pruem) @Trizek-WMF: It would help if... [07:48:59] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2005.codfw.wmnet'] ` and were **ALL** successful. [08:29:18] 10Traffic, 10Operations: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10ema) [08:29:32] 10Traffic, 10Operations: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) >>! In T226776#5291447, @ema wrote: > Well, the error response was surely generated by the applayer and not by varnish (the latter only generates synthetic r... [08:30:23] 10Traffic, 10Operations: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10ema) p:05Triage→03Normal [08:31:06] vgutierrez: I've triaged T226805 as TLS mostly because I'm bored of always putting everything under Caching :) [08:31:07] T226805: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 [08:32:45] LOL [09:21:44] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2008.codfw.wmnet'] ` The log can be found in `... [09:29:16] legoktm: hi :) This one is for you https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519606/ [09:59:23] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2008.codfw.wmnet'] ` and were **ALL** successful. [10:08:38] 10netops, 10Operations, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10fgiunchedi) [10:41:14] 10netops, 10Operations, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10jbond) p:05Triage→03Normal [10:46:31] I am seeing something weird for eventstreams [10:46:38] https://stream.wikimedia.org/v2/stream/recentchange returns 502 now, from nginx [10:47:30] I checked on scb* nodes and the backend seems working [10:47:44] I am a little bit puzzled about nginx returning 502 here though [10:47:49] ema: --^ (if you have time) [10:49:03] or vgutierrez [10:49:18] uh [10:49:23] interesting [10:49:39] errr is it working for me here [10:49:43] *it's [10:49:57] elukey: which DC? esams? [10:50:35] yeah I should go through it [10:50:50] even if I received the alarm from icinga1001 [10:51:55] ah lovely https://grafana.wikimedia.org/d/000000612/frontend-responses-nginx-vs-varnish?orgId=1 [10:52:29] seems esams and eqiad vgutierrez [10:52:51] yeah, I'm reaching that via eqson [10:52:53] *eqsin [10:56:53] * vgutierrez swithing bastion hosts... [10:56:57] *switching [10:58:31] so.. error log on cp3032 is complaining about stream.wm.o [10:59:32] https://www.irccloud.com/pastebin/PbOnMdLh/ [10:59:41] elukey: ^^ [11:00:01] take into account that upstream in that context means the varnish-fe instance sitting on cp3032 [11:00:02] I was trying to get the same [11:00:44] is that from unified error log? [11:00:48] I am on cp3033 [11:00:57] yes [11:01:21] weird, nothing on cp3033 [11:03:15] so I can trigger it from here with a simple curl --resolve stream.wikimedia.org:443:91.198.174.192 [11:04:14] it looks like I'm hitting cp3042 [11:06:31] start time seems to be 10:18:30 [11:08:00] interesting.. nginx interprets that's a net issue and retries 8 times [11:08:06] (once per varnish-fe port) [11:08:51] so sudo varnishlog -n frontend -q 'ReqHeader ~ "Host: stream.wikimedia.org"' doesn't return anything [11:09:11] ah no wait something is returning, takes a while [11:25:28] mobrovac: are you around by any chance? [11:27:35] I am now wondering if what I did this morning for codfw triggered some weird behavior [11:27:57] so I logged at 8:43 in the SAL the restart of all the eventstreams daemons on scb2* (codfw) [11:28:03] because they were not working [11:28:14] the ramp up [11:28:15] https://grafana.wikimedia.org/d/000000336/eventstreams?refresh=1m&orgId=1&from=now-12h&to=now&var-stream=All&var-topic=eqiad_mediawiki_revision-create&var-scb_host=All [11:28:19] seems to more or less match [11:28:28] (the start of the ramp up) [12:01:45] so.. it's ramping up back again [12:02:39] 10Traffic, 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Problem was reproduced by me j... [12:20:35] 10Traffic, 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10TheDJ) Just happened to me again as... [12:35:12] 10netops, 10Operations: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Classification successfully deployed in ulsfo/codfw/eqdfw/eqord (half-ish of our POPs), will push to the other sites early next week. Then start dropping invalids on IXPs to see the effect it has in term of traffic... [12:57:11] 10Traffic, 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Here is a screenshot of laggy... [13:53:01] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) The queries for `varnishstatsd` metrics I've been able to find during the audit: ` (varnish.$dc.backends.be_*api_svc*.GET.sample_rate, 60... [14:10:38] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2011.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [14:48:27] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2011.codfw.wmnet'] ` and were **ALL** successful. [14:49:01] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) >>! In T204056#5285817, @jcrespo wrote: > This is blocked on @CRoslof or someone else from legal. Is there any more information we can provide on the i... [16:01:54] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10Wikimedia-production-error: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle) [16:02:20] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle) [16:04:09] 10Traffic, 10MediaWiki-extensions-CentralAuth, 10Operations, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10CDanis) NB that the default limit in Varnish actually... [17:16:47] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) 05Stalled→03Open [22:55:11] ema: +1'd