[09:29:47] 10Traffic, 06Operations: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3167754 (10Aklapper) Feel free to bring up any further discussion topics on the talk page of https://meta.wikimedia.org/wiki/Sustainability_Initiative which is the centralized place. [11:10:44] 10Traffic, 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3167863 (10fgiunchedi) p:05Triage>03Normal [11:11:27] 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3167864 (10fgiunchedi) p:05Triage>03Normal [12:45:01] 10Traffic, 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3168117 (10BBlack) 05Open>03Resolved a:03BBlack Merge above should fix this, at least for this case and any others on our cache ter... [12:55:00] so, what about the 4.9 stuff? Did we end up deciding there were some things to fix with our 4.9 setup before we reboot more caches, or? [12:55:15] ema: ^? [12:56:07] bblack: nope! The problems we found seemed to be confined to one host (the one that took ~3minutes to bring up eth0) [12:56:14] so I'd say we can carry on [12:57:24] ok cool [12:57:28] I was now trying to figure out what's going on with phabricator [12:58:21] bblack: there are clusters of errors in iridium:/var/log/apache2/phabricator_error.log, who can I ping about that? [12:59:09] perhaps twentyafterfour [12:59:58] 10Traffic, 10netops, 06Operations: knams equipment move - https://phabricator.wikimedia.org/T162601#3168183 (10ayounsi) [13:11:02] ema: btw, we figured the evoswitch remote hands last week for other reasons [13:11:12] so if you want to attempt fixing cp3003, we can [13:11:36] that said, I doubt we have spare parts (like a DAC) anywhere they could access [13:12:20] unless we have 10G servers that we have decom'ed that are still racked, not sure :) [13:12:30] (talking about T162132) [13:12:31] T162132: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132 [13:26:59] paravoid: thanks! [13:30:10] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3168304 (10Aklapper) I fail to load https://upload.wikimedia.org/wikipe... [13:35:14] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3168334 (10ema) >>! In T162035#3168304, @Aklapper wrote: > I fail to lo... [13:38:25] ema: re the CE:gzip variant being co-reported, I agree that's probably separate (and probably has been ongoing for a while, but going unreported due to low incidence rate?) [13:38:48] I seem to remember it was also one of the possibly symptoms of the CL:0 bug we were fighting forever [13:39:26] we still have some CL:0 hacks in place on cache_upload, I don't know if we ever finished looking into that to decide on a real fix or removing the hack [13:39:42] e.g. in upload-common: [13:39:44] sub upload_common_backend_response { [13:39:44] // Debugging T144257. Don't cache 200 responses with CL:0. [13:39:44] if (beresp.http.Content-Length == "0" && beresp.status == 200) { [13:39:45] T144257: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257 [13:39:48] set beresp.ttl = 0s; [13:39:50] set beresp.uncacheable = true; [13:44:18] bblack: mmh, in this case CL is > 0 though [13:48:32] yeah [13:49:46] T148830 ? [13:49:47] T148830: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830 [13:50:54] that looks like it yeah [13:52:48] re-opening [13:55:17] 10Traffic, 06Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#3168355 (10ema) 05Resolved>03Open Reopening, another instance of this bug has been reported in T162035#3168304. [14:20:43] 10Traffic, 10netops, 06Operations: knams equipment move - https://phabricator.wikimedia.org/T162601#3168403 (10ayounsi) After discussion with @BBlack As knams going down will not impact connectivity between esams and eqiad, and esams has enough transit capacity to take over knams transits, the following pla... [14:32:37] bblack: I'm gonna carry on with some 4.9 upgrades now [14:33:34] ok [15:09:45] moritzm: FYI cp2015 seems to be showing the same issue we've seen last week, it took a while bringing up eth0 [15:09:56] identical hardware? [15:10:09] 3min 1.271s networking.service [15:11:55] moritzm: yes [15:12:03] (the other host was cp2006) [15:16:10] similar backtrace of a hung process as on cp2006, I'd say let's blacklist uncore_pci on those hosts? [15:16:34] intel_uncore I meant [15:16:49] moritzm: seems reasonable, I've opened T162612 to track down the problem [15:16:50] T162612: codfw hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612 [16:45:36] all maps servers running 4.9 [16:45:48] \o/ [17:19:25] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168916 (10Papaul) main board replacement complete on lvs2002, System is back up. @elukey please check everything is okay while I am on site. Thanks. [17:29:14] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168958 (10Papaul) a:05Papaul>03elukey [17:32:45] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3168972 (10BBlack) a:05elukey>03BBlack Switching this to me [17:59:47] 10netops, 06Operations, 10ops-eqiad: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199#3169051 (10ayounsi) 05Open>03Resolved Interface has been stable. Everything looks good. Thanks! [18:02:04] 10netops, 06Operations: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3169060 (10ayounsi) 05Open>03Resolved a:03ayounsi > XioNoX> I'm secretly hoping that T154507 was caused by T162199, it's on the path, and the LACP hashing algorithm would expla... [18:11:53] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3169146 (10BBlack) a:05BBlack>03ayounsi @papaul Everything looks good with lvs2002 (checked icinga, interfaces on correct vlans, etc). @ayounsi Let's let it burn in with no traffic until tomorrow s... [19:19:37] bblack: I have deployed https://gerrit.wikimedia.org/r/#/c/346543/, but I still see mostly no traffic to the codfw cluster... did I miss something? [19:25:01] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching): Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3169439 (10Smalyshev) > I'm also leading to using the dash. data would be equivalent to main-data I like the idea of no dash... [19:25:29] <_joe_> gehel: did you run puppet everywhere on caches? [19:25:38] <_joe_> and specifically in codfw? [19:26:15] I let puppet run its course, but change was merged > 2h ago... [19:26:31] <_joe_> oh ok then [19:27:51] we don't have that much traffic and a lot of it is bots, so we might just have all client close geographically, but it does seem like a suspicious explanation... [19:27:58] <_joe_> gehel: so that's for query.wikidata.org right? [19:28:00] <_joe_> yeah [19:28:03] right [19:28:36] <_joe_> so one thing you could try is to set in your /etc/hosts a record for that that sends you to the ulsfo or codfw IP [19:28:46] <_joe_> and tail requests in the logs in codfw [19:28:51] <_joe_> you should see your requests [19:28:59] good idea, will try... [19:29:40] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3169490 (10Papaul) @BBlack Thanks. [19:29:55] <_joe_> most bots will go to eqiad as they're coming from toollabs [20:28:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169636 (10Gehel) I'm not sure the change is effective. While I do see a few requests (outside of pyball / icinga) in the nginx logs on the wdqs codwf servers, I don't see a... [20:28:42] gehel: yeah most likely things like that [20:30:47] for a quick direct example, misc-web-lb.ulsfo is 198.35.26.120 and misc-web-lb.esams is 91.198.174.217 [20:31:13] bblack@alaxel:~$ curl -v https://query.wikidata.org/ --resolve query.wikidata.org:443:198.35.26.120 2>&1 |grep -i X-Cache: [20:31:17] < x-cache: cp2006 hit/4, cp4001 miss, cp4003 miss [20:31:18] bblack@alaxel:~$ curl -v https://query.wikidata.org/ --resolve query.wikidata.org:443:91.198.174.217 2>&1 |grep -i X-Cache: [20:31:21] < x-cache: cp1058 hit/2, cp3010 pass, cp3008 hit/1 [20:31:40] (this doesn't actually document your backend servers, but it documents the unique non-overlapping cache paths) [20:34:52] * gehel needs a few minutes to parse that... [20:35:18] * gehel is just deploying the logstash upgrade as well... not good at multitasking [20:35:27] the x-cache lines document the caches the response passed through, deepest cache on the left, front-most on the right [20:35:57] so querying wdqs via esams edge uses esams (frontend), esams, eqiad. hitting the ulsfo edge uses ulsfo (frontend), ulsfo, codfw. [20:36:36] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169658 (10Smalyshev) @Gehel: you can check x-served-by headers in the responses - half of those should have codfw hosts there now. [20:38:06] bblack: so if I understand you well, it looks good, we just happen to not have much traffic going to codfw? [20:38:13] gehel: right [20:38:26] oh I didn't know about x-served-by [20:39:00] uh, I don't see that header in wdqs output, at least not on the main page [20:40:15] X-Served-By (case sensitivity?) [20:40:59] Ah, probably served only by blazegraph, so not on the front page [20:41:43] any idea why I don't yet see those hits in graphite? [20:51:38] what's a test URL that has the header? [20:52:27] ah I copied one from your ticket, I'll work from that for a better example [20:53:38] gehel: [20:53:39] bblack@alaxel:~$ curl -v 'https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%23Streets%20without%20a%20city%0ASELECT%20%3Fstreet%20%3FstreetLabel%0AWHERE%0A%7B%0A%20%20%20%20%3Fstreet%20wdt%3AP31%2Fwdt%3AP279*%20wd%3AQ79007%20.%0A%20%20%20%20%3Fstreet%20wdt%3AP17%20wd%3AQ142%20.%0A%20%20%20%20MINUS%20%7B%20%3Fstreet%20wdt%3AP131%20%5B%5D%20%7D%20.%0A%09SERVICE%20wikibase%3Alabel%20%7B [20:53:45] %20bd%3AserviceParam%20wikibase%3Alanguage%20%22fr%22%20%7D%0A%7D%0AORDER%20BY%20%3FstreetLabel' -H 'Accept: application/sparql-results+json' -H 'User-Agent: curl (testing/gehel)' --resolve query.wikidata.org:443:198.35.26.120 2>&1|grep -i X-Served-By: [20:53:49] < x-served-by: wdqs2002 [20:53:52] bblack@alaxel:~$ curl -v 'https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%23Streets%20without%20a%20city%0ASELECT%20%3Fstreet%20%3FstreetLabel%0AWHERE%0A%7B%0A%20%20%20%20%3Fstreet%20wdt%3AP31%2Fwdt%3AP279*%20wd%3AQ79007%20.%0A%20%20%20%20%3Fstreet%20wdt%3AP17%20wd%3AQ142%20.%0A%20%20%20%20MINUS%20%7B%20%3Fstreet%20wdt%3AP131%20%5B%5D%20%7D%20.%0A%09SERVICE%20wikibase%3Alabel%20%7B [20:53:58] %20bd%3AserviceParam%20wikibase%3Alanguage%20%22fr%22%20%7D%0A%7D%0AORDER%20BY%20%3FstreetLabel' -H 'Accept: application/sparql-results+json' -H 'User-Agent: curl (testing/gehel)' --resolve query.wikidata.org:443:91.198.174.217 2>&1|grep -i X-Served-By: [20:54:02] < x-served-by: wdqs1001 [20:54:17] (as with the x-cache line, this is showing that reqs entering ulsfo hit apps in codfw, and reqs entering esams hit eqiad) [20:54:25] yep, I see the same. So traffic is really served from codfw, but much lower than I expected [20:55:01] eqiad is the default geoip route for un-locateable traffic, and also for all of labs [20:55:43] the (dns) source of the traffic has to be in the western or central area of the US or asia, basically, to reach codfw by deault [20:55:51] Oh, and I'm so stupid... [20:56:15] My grafana dashboard was filtering on eqiad, so no codfw hits were showing... [20:56:21] :) [20:57:14] Sorry for wasting your time :( and thanks for the help! [20:57:45] it's never a waste of time to verify! this is only the 3rd service to turn on active/active, and the other two aren't "real" services, just internal ops simplistic stuff :) [20:59:11] well, it does look like it is working! [21:01:05] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3169714 (10Gehel) grafana dashboard was wrongly filtering on eqiad only (that's why I did not see any traffic there). More tests and checking x-cache and x-served-by headers...