[08:00:23] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3213051 (10elukey) >>! In T163002#3211701, @Ottomata wrote: > Hm, we got a problem! These Kafka nodes are in the Analytics VLAN networks, AND have IPv6 configured.... [09:24:11] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3183025 (10akosiaris) analytics1-b-eqiad IPv6 does exist and it's `2620:0:861:105/64`. I 've update the DNS templates in the patch above to refl... [09:31:16] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213176 (10Joe) For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time. [09:48:45] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213213 (10elukey) >>! In T148506#3213176, @Joe wrote: > For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocat... [12:11:33] 10netops, 06Operations: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3213510 (10ayounsi) 1. No errors on telia's side, see: ``` show interfaces TenGigE0/12/0/25 detail Mon Apr 24 11:01:19.740 CET TenGigE0/12/0/25 is up, line protocol is up Interface state transitions: 1... [12:16:43] all cache hosts upgraded to linux 4.9 except for cp1066 (broken drac, not sure how safe it is to reboot it) [12:16:57] (T163889) [12:16:57] T163889: cp1066 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T163889 [12:20:53] ema: \o/ [12:21:05] ema: what do you mean about "how safe it is to reboot it"? [12:21:42] bblack: if it doesn't reboot properly, we've got no way to fix that :) [12:22:05] oh it's still up you mean [12:22:21] yeah, I haven't upgraded it because T163889 [12:22:21] T163889: cp1066 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T163889 [12:22:25] ok [12:22:46] no reasons to believe it *should* fail rebooting, all other cache hosts in eqiad went totally fine [12:22:52] I'd say don't set the auto-repool file [12:23:07] so if it doesn't come up we don't have to worry about when it does from pulling power cords or whatever >8h later [12:23:15] ok! [12:33:03] long reboot, but it did come back up fine \o/ [12:41:44] mmh all upload backends in esams and ulsfo are returning a few 503s [12:41:48] codfw is fine though [12:43:06] I spoke too early, cp2014. Depooling [12:45:18] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp2014&var-datasource=codfw%20prometheus%2Fops [12:45:29] no significant mailbox lag [12:46:33] 12:42:58 is when fetch failed spiked [12:47:17] and at the same time backend objects expiration rate spiked too [12:50:23] cached memory started going down at ~12:20 [12:52:14] but before all that, backend connections started piling up: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&var-server=cp2014&var-datasource=codfw%20prometheus%2Fops&from=1493203288562&to=1493211061960 [12:52:25] that is: varnish-be <-> swift [12:55:20] interestingly, a similar issue is affecting ulsfo: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4015&var-datasource=ulsfo%20prometheus%2Fops [12:59:50] and now cp2017, restarting varnish-be there [12:59:56] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213672 (10Marostegui) The databases affected by the move are now off and can be moved anytime: ``` es1019 db1094 db1093 db1092 db1091 ``` [13:02:23] trying to repool cp2014 and log some errors [13:15:26] no trouble on cp2014 yet [13:17:48] sounds like swift overload on object expiration? [13:17:50] maybe? [13:18:01] I don't know that depool/restart is going to help with that [13:20:33] mmh, they were all 'cp2017 int' [13:21:18] I think they would have been cp2107 miss/pass if the error came from swift? [13:22:15] oh I see, you don't mean 503s from swift but rather swift causing the fetch to fail [13:22:24] maybe? :) [13:23:51] at any rate, it seems that cp2017 got fixed by varnish-be restart, cp2014 simply by a ~20m depool [13:24:22] I'm keeping an eye on the other upload-codfw hosts, will varnishlog next time it happens [13:27:34] note that there is a low but steady rate of 503s in esams too [13:27:58] seems like esams hosts are failing to fetch from codfw (without codfw failing the fetches from swift) [13:28:18] perhaps fallout from the expiry thread prio patch? [13:28:57] I don't remember seeing this before [13:39:55] check and see if the expiry thread is locking up a cpu core or something? [13:40:02] there should be evidence if it's causing a problem [13:42:33] not locking up at the moment [13:42:47] htop -p `ps -flLcu vcache | awk '$8=="FF" { print $6 }'` [13:43:40] but yeah we should check again when the errors happen [13:50:38] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3213910 (10Andrew) We're pretty sure that the only Labs thing affected by this is instance creation. I've disabled instance creation for now, wi... [14:04:36] I think we missed one formal thing in the switchover, the security_audit backend in hieradata/role/common/cache/text.yaml. I guess it should be switched too right? (although not affecting users AFAICT) [14:38:59] there's no need to touch security_audit [14:39:13] it only exists in labs, but it's included in the production data because of whatever horrible things about how it was implemented [14:39:25] <_joe_> lol [14:42:45] lol, ok, at least it explains the asymmetric output from the puppet run during the switchover (the be_appservers_svc_eqiad_wmnet not being removed) [14:46:08] ah [14:46:11] yeah good point [14:54:03] the 503s in esams are due to -- FetchError http first read error: EOF [14:55:56] (and the exp threads in codfw are not locking up) [15:13:55] sorry I'm distracted from actual techie things [15:14:21] so we've got a few things going on here right now, right? [15:14:48] yes :) [15:14:53] Eqiad Row D maint, some kind of mysterious upload 503 issues in multiple DCs [15:15:18] is there any chance we're having actual network-layer issues between DCs? [15:15:46] (or that eqiad upload machines going offline is somehow impacting ulsfo/esams/codfw because they're still health-checking those backends even if they're not sending them traffic??) [15:16:03] (I'd doubt that, just digging for a correlation) [15:16:06] mmh no, the 503s started before row 8 maint [15:16:10] ok [15:16:26] smokeping is all green between DCs [15:17:26] XioNoX: thanks for checking [15:24:32] yeah so the 503s seem to correlate across all of codfw+ulsfo+esams [15:24:45] so it kinda has to be something in codfw (or beneath, but probably not) [15:25:30] what's interesting is that I didn't get any "Couldn't get storage" yet [15:25:57] well we can always get storage now, because the thing that frees up the objheads is higher priority than anything else :) [15:26:46] right, but somehow the pattern feels similar to the cant-get-storage issue in the sense that it occurred after a couple of days of varnish-backend uptime [15:27:12] looking at ~12-13 on cp2014 [15:27:31] there was a very odd drop in Cached mem there too [15:27:39] yeah [15:28:12] and a timewait spike + lack of estab in tcp stats [15:28:40] varnish-backend was depooled between 12:45 - 13:05 [15:28:47] client req rate dipped (before bouncing up in reaction) [15:28:54] backend connections ramped up hard [15:29:25] expiry mailbox lag was spiking up to values around 40-60 too [15:30:17] honestly, it really looks like it could've been problems reaching swift [15:30:33] if it wasn't for the fact that we have that ongoing dangerous patch experiment, I would've looked there first [15:32:36] it ended up making tons of backend connections in parallel [15:32:43] which is what it would do if those requests were stalling out [15:33:02] the question is: were those swift requests stalling out because of swift, or because of varnish internal problems [15:33:58] the cached mem dip happened before the backend connection spike though [15:34:03] 12:30-12:45 is kind of the critical window of "503s were appearing, but not depooled yet" [15:34:26] cached mem dip starts at about 12:20 [15:34:53] but the count of backend connections to swift started ramping off into the sunset back around 12:00 [15:35:05] (really even a bit earlier than that) [15:35:45] but during that initial period of ~12:00-12:30 while connections were ramping up, there's not much else giong on in terms of locks or expiries, etc [15:36:28] it's happening on cp2024 as we speak (backend conns ramping up) [15:36:36] ok good [15:36:41] let's look at logs there? [15:36:54] I've got varnishlog open on all upload-codfw, no 503s yet [15:37:12] can you tell if swift fetches are getting slow/stalled? [15:37:47] let's see [15:38:36] -- Timestamp BerespBody: 1493221107.423026 26.965755 26.937674 [15:38:51] - Timestamp BerespBody: 1493221089.832732 12.286247 12.214209 [15:39:00] yeah some long ones sprodically [15:39:04] which could tie up backend conns [15:39:21] my long one was on /wikipedia/commons/thumb/b/bd/Noctua.pronuba.7592.jpg/300px-Noctua.pronuba.7592.jpg [15:39:42] godog: everything good with swift in codfw? We're observing slowdowns in varnish-be<->swift fetches [15:40:11] Beresp timestamp was quick (headers), the 12s was transfer time for the bytes of body [15:40:47] but again, it could still be varnish-internal. e.g. those 12s are spent locked up on some kind of storage allocation issue while it's fetching the body. [15:41:51] ema: checking, I don't see errors on https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1&from=now-3h&to=now-1m but it is peak time of the day with ~2k req/s so it might be slow alright [15:42:29] another 12s fetch I looked at was /wikipedia/en/thumb/9/93/MacRumors.svg/512px-MacRumors.svg.png , which is a little 30KB file [15:42:42] it doesn't make sense for it to take 12s to fetch that body for sure, the question is just why [15:42:55] right [15:43:07] and perhaps the 503s start when the fetch starts after between_bytes_timeout [15:43:21] !log VRRP priority removed, interfaces cr2/asw2 renamed - T148506 [15:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:29] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [15:43:50] or first_byte_timeout, or one of those :) [15:46:42] I'm looking at established connections on ms-fe2005 and there's two orders of magnitude more connections from cp2024 than the rest, expected? [15:46:55] like ~1400 vs ~15 from the rest [15:46:57] -q 'Timestamp ~ "BerespBody: [0-9.]+ [1-9][0-9]\."' [15:47:05] godog: known but not normal [15:47:07] ^ good varnishlog filter to catch just the ones with long fetch times [15:47:22] godog: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&from=now-3h&to=now&var-server=cp2024&var-datasource=codfw%20prometheus%2Fops [15:47:43] godog: yeah we're observing the excess connections too. Varnish's first-level stats explanation of that is that fetches over those connections are taking too long, so it's having to spawn more in parallel to handle user requests [15:48:06] godog: the question is whether swift is actually responding slow occassionally, or it's varnish's own internal mess slowing it down [15:48:28] ah ok, thanks I'll be looking for slow requests from swift's POV too [15:49:57] say longer than 5s ? [15:50:01] yeah [15:50:07] we've got a fair number that as 12s+ [15:50:37] - BereqURL /wikipedia/commons/thumb/4/44/Trittau_in_OD.svg/160px-Trittau_in_OD.svg.png [15:50:40] - Timestamp BerespBody: 1493221812.381432 13.614155 13.585140 [15:50:43] - BereqURL /wikipedia/commons/thumb/1/13/Manni-breuckmann-in-hamm.jpg/220px-Manni-breuckmann-in-hamm.jpg [15:50:46] - Timestamp BerespBody: 1493221814.747388 13.801622 13.694901 [15:50:48] - BereqURL /wikipedia/commons/thumb/f/f8/Moldavian_SSR_1940.jpg/180px-Moldavian_SSR_1940.jpg [15:50:51] - Timestamp BerespBody: 1493221815.199637 27.571708 27.504118 [15:50:53] ^ catching those with my filter, they come in pretty quickly, 13-41s delay on transfer [15:50:56] - BereqURL /wikipedia/commons/thumb/9/94/Frederick_Henry_Mueller_pers0162.jpg/350px-Frederick_Henry_Mueller_pers0162.jpg [15:50:59] - Timestamp BerespBody: 1493221815.998683 41.205522 41.139799 [15:51:18] we've just got a 503 burst on cp2024, see ~ema/2017-04-26-503-2.log for the logs [15:51:18] does swift send a header indicating which cluster member is responding or any other kind of helpful detail? [15:51:30] well it probably ran out of backend conns [15:51:44] bblack: there should be a transaction id in the response yeah [15:52:27] wikipedia-commons-local-thumb.e1/e/e1/Air_Cdre_I_M_Bonham-Carter.jpg/52px-Air_Cdre_I_M_Bonham-Carter.jpg 7.4082s [15:52:30] - BereqURL /wikipedia/commons/thumb/5/55/Bohemiae_Moraviae_et_Silesiae_%28Merian%29_102.jpg/600px-Bohemiae_Moraviae_et_Silesiae_%28Merian%29_102.jpg [15:52:34] - BerespHeader X-Trans-Id: tx9c1765fdc39b48eeb8ba8-005900c218 [15:52:35] bblack: all other cache-codfw too though [15:52:36] - ObjHeader X-Trans-Id: tx9c1765fdc39b48eeb8ba8-005900c218 [15:52:38] ^ there's a 26s with a transid [15:52:41] - Timestamp BerespBody: 1493221939.138142 26.263658 26.189860 [15:53:02] ema: all other cache-codfw what? [15:53:05] ok I'll grep for that tx-id [15:53:24] bblack: all upload-codfw hosts had a 503 burst roughly at the same time [15:53:30] oh [15:53:45] did it build up the same on them all, or build up on one and then spike the rest at the end? [15:54:28] backend conns are ramping on e.g. 2022 as well, just not as bad [15:54:41] the distinction between cache host may be just which objects hash to them or whatever [15:55:16] bblack: Apr 26 15:51:52 ms-fe2006 proxy-server: 10.64.48.42 10.192.48.28 26/Apr/2017/15/51/52 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.55/5/55/Bohemiae_Moraviae_et_Silesiae_%2528Merian%2529_102.jpg/600px-Bohemiae_Moraviae_et_Silesiae_%2528Merian%2529_102.jpg HTTP/1.0 200 - mw-ocg-bundler/1.3.1-git/unknown - - 59884 - tx9c1765fdc39b48eeb8ba8-005900c218 - 0.0727 - - 1493221912.875303984 1493221 [15:55:23] 912.947978973 [15:55:24] (also, other caches' 503s could be the result of cp2024 spamming so many conns that it locks out other caches from swift) [15:55:28] sorry long paste, anyways 0.07 according to swift-proxy [15:55:45] swift-proxy is indicating it took 0.07s to send the response body to varnish? [15:56:13] I'm confirming but yeah I think that's total transfer time in the logs [15:56:23] the TS is nice too [15:56:25] bblack: yes the pattern is comparable on all other caches' so the conn limit from swift seems possible [15:56:34] 1493221912 vs 493221939 [15:56:41] 27s difference [15:57:05] ema: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=now-6h&to=now&var-server=cp2022&var-datasource=codfw%20prometheus%2Fops&panelId=16&fullscreen [15:57:09] ^ cp2022 backend conns [15:57:16] you can see mini-ramps that recover, not as bad as the bad cases [15:57:32] oh, duh, that's frontend not backend! [15:57:35] hmmm [15:58:39] bblack: yeah that's request time, the last two timestamps are start/end [15:58:43] namely 1493221912.875303984 1493221912.947978973 [15:59:24] godog: can you tell if we're hitting some swift limits? e.g. max connections, timeouts, ... [15:59:35] lots of timestamps in the 503 logs are either 7s or 95s, perhaps some swift or varnish timeout [16:00:26] ok [16:00:55] I have a meeting starting... [16:01:19] ok have fun! :) [16:01:39] ema: could be yeah, nothing obvious from the logs but I'll keep looking [16:03:22] godog: thanks. 7 or 95 seconds, any timeout set around those values? [16:13:25] ema: in some cases with requests >5s I see swift logging 499 which afaict is client has disconnected, some others are logged as 200s though [16:14:04] ok we've got timeout_idle set to 5s in varnish [16:14:17] Idle timeout for client connections. A connection is considered idle, until we have [16:14:19] yeah but timeout_idle is no-progress right? [16:14:20] received the full request headers. [16:14:54] oh no, timeout_idle is I guess "end of last request" -> "headers finished on new response" for reused keepalive conns? [16:15:37] to answer your question, I'm not seeing timeouts to 7 or 95 set in the code but it might be compounded with smaller timeouts [16:16:01] no, timeout_idle is for client side [16:16:29] oh, right [16:16:49] *request* headers [16:19:56] I'll take a look at the latencies in graphite too from ms-be / ms-fe [16:22:50] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching), 15User-Daniel: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3214611 (10daniel) [16:24:44] hmmm [16:28:52] backend conns dropped on cp2024 (but are still high) [16:29:06] drop at 16:18 [16:29:26] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching), 15User-Daniel: RFC: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3214641 (10daniel) [16:30:03] have to jump in a meeting [16:32:36] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching), 15User-Daniel: RFC: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134380 (10daniel) [16:33:28] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching), 15User-Daniel: RFC: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134380 (10daniel) [16:34:48] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching), 15User-Daniel: RFC: Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3214677 (10daniel) [16:35:19] ema: [16:35:20] 16:17 <+icinga-wm> RECOVERY - Host ms-be1021 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [16:35:23] ^ related? [16:35:29] I know, it's eqiad [16:35:31] right [16:35:34] but still [16:36:56] I manually confirmed just in case, that cp2024 VCL is in fact referencing codfw ms-fe hostnames, not eqiad [16:40:15] it's hard to deal with this in terms of correlations [16:40:21] there's just so much going on with everything today [16:40:33] you can find a correlation for anything if you stretch a bit [16:45:37] it's dropping back off again [16:45:39] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10daniel) [16:45:42] I was just starting to look with tcptrack [16:46:22] right now, no new connections are showing stall-like behavior [16:46:30] wait, wrong heh [16:47:53] it's interesting to watch the pattern [16:48:25] right at this moment, no backend conns buildup on 2024 [16:49:13] tcptrack -dfpi eth0 src or dst 10.2.1.27 [16:49:16] similar for the frontends. on eg: cp2002, was probably building up trying to fetch stuff from cp2024 [16:49:17] (have to install it first) [16:49:51] the -d part makes it only track new conns since you started the tool (otherwise it tries to notice in-progress ones too which gets confusing) [16:50:03] there was definitely a pattern there near the end (when I first got that running) [16:50:14] where a lot of conns would idle out for long periods with 0 B/s xfer [17:29:54] bblack: to pinpoint the servers with issues: https://grafana.wikimedia.org/dashboard/db/varnish-swift-connections?orgId=1 [17:30:30] actually there's no real need to distinguish between thumbs and non-thumbs but whatever [17:31:30] cp2017 is the one I've restarted earlier on today [17:33:10] why does each have 3x entries? [17:34:13] awesome question, no clue [17:34:28] I bet VCL reloads [17:34:37] that whole hash-value thing on the backend defs [17:34:57] and you'd be right [17:35:19] I've added {{server}} to the legend [17:35:51] $ curl -s http://localhost:9131/metrics|grep backend_conn|grep ms_fe [17:35:51] varnish_backend_conn{backend="be_ms_fe_svc_codfw_wmnet",server="d1872e37-7081-4987-a102-1e6627325afd"} 2 [17:35:54] varnish_backend_conn{backend="be_ms_fe_thumbs_svc_codfw_wmnet",server="d1872e37-7081-4987-a102-1e6627325afd"} 10 [17:37:11] so yeah somehow we should group by instance and sum (godog, ideas?) [17:37:30] current query is varnish_backend_conn{layer="backend", backend="be_ms_fe_svc_codfw_wmnet"} [17:41:41] and I've added Failed Fetches underneath to see the actual 503 impact [17:43:02] ema: I'm at SoS, should be done shortly [17:43:09] godog: no rush [17:49:39] BTW I can't ssh into any of the upload-eqiad hosts, perhaps because of today's maintenance? [17:50:21] I assume [17:51:12] I can login through drac though [17:53:05] ema: yeah sum by (instance) (...) will do it in this case [17:53:33] what bblack said re: the server, it is the churn experienced on vcl reload https://phabricator.wikimedia.org/T150479 [17:54:20] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Prometheus varnish metric churn due to VCL reloads - https://phabricator.wikimedia.org/T150479#3215013 (10fgiunchedi) [17:55:56] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10ema) I currently can't SSH into any of the following hosts: cp1071, cp1072, cp1073 and cp1074. Presumably, this is due to today's main... [17:57:31] godog: thanks! [17:59:13] ema: interesting, other than SSH are they working as expected? [17:59:37] XioNoX: yes [18:01:34] well, they're not serving users [18:01:36] ema: can you point me to another host in the same vlan I can ssh to? [18:02:14] they're working as expected in the sense that no icinga checks fail, I can hit the varnish instances with curl and get valid responses and so forth [18:02:36] oh? that's odd [18:03:36] could it be iptables or something similar? [18:03:40] try adding "-4" to ssh [18:03:52] I think it's just ipv6 that's broke and ssh is using it by default [18:04:12] at least that's what I seem to observe about port 22 from codfw [18:04:15] (to cp1071) [18:04:25] nope [18:05:06] cr1-eqiad> ssh 10.64.48.105 source 10.64.48.2 doesn't work [18:05:16] is ther ean icinga check for ssh? [18:05:28] well ok sshg proxying issues [18:05:34] but it's still an ipv6 problem [18:05:40] bblack@bast1001:~$ telnet cp1071.eqiad.wmnet 22 [18:05:40] Trying 2620:0:861:107:10:64:48:105... [18:05:40] ^C [18:05:45] bblack@bast1001:~$ telnet -4 cp1071.eqiad.wmnet 22 [18:05:46] Trying 10.64.48.105... [18:05:46] Connected to cp1071.eqiad.wmnet. [18:05:46] Escape character is '^]'. [18:05:46] SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3 [18:06:01] yeah, indeed, from the bastion I can reach it with -4 [18:07:02] curl also doesn't work if explicitly resolved to ipv6 [18:07:27] * connect to 2620:0:861:107:10:64:48:105 port 443 failed: Connection timed out [18:07:45] * Connected to upload.wikimedia.org (10.64.48.105) port 443 (#0) [18:08:34] the ip seems to be set correctly [18:08:36] 2: eth0: mtu 1500 state UP qlen 1000 [18:08:39] inet6 2620:0:861:107:10:64:48:106/64 scope global [18:08:42] valid_lft forever preferred_lft forever [18:08:43] (that's cp1072) [18:08:45] inet6 fe80::46a8:42ff:fe0a:6f66/64 scope link [18:08:49] valid_lft forever preferred_lft forever [18:09:08] elukey, your issue seems related to ^ [18:09:34] ah what a joyful day! [18:09:56] ema: what's the output of ip -6 route ? [18:10:06] local ::1 dev lo proto kernel metric 256 [18:10:06] 2620:0:861:107::/64 dev eth0 proto kernel metric 256 [18:10:06] unreachable 2620:0:861:ed1a::2:b dev lo proto kernel metric 256 error -101 [18:10:08] fe80::/64 dev eth0 proto kernel metric 256 [18:10:12] it's possible we're hitting a version of https://phabricator.wikimedia.org/T133387 [18:10:19] no default route on v6 [18:10:23] yeah it doesn't have a default [18:10:34] default via fe80::1 dev eth0 proto ra metric 1024 expires 595sec hoplimit 64 [18:10:40] ^ is what you should see (but that's on a codfw node) [18:10:46] some host have it, some hosts don't [18:10:56] not sure what's the common point yet [18:11:01] so yeah likely related to IPv6 RA, which is the multicast / snooping issue [18:11:07] default via fe80::1 dev eth0 proto ra metric 1024 expires 584sec hoplimit 64 [18:11:10] this is cp1050 ^ [18:11:31] yeah but cp1050 isn't in row D [18:11:47] disable igmp snooping if it's enable for the new row D switch? [18:11:49] for example conf1003, in row D is fine [18:12:00] bblack: yeah it's enabled [18:13:23] uh and now what's going on with cp2014? [18:14:20] I'm getting a password prompt while trying to ssh to cp1071 from bast1001, how do I tell it to use my ssh key? [18:15:31] checked an1002: [18:15:34] 2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000 [18:15:35] Apr 26 18:09:14 cp2014 varnishd[1828]: Child (1842) died signal=9 [18:15:37] link/ether f0:1f:af:e8:bc:3f brd ff:ff:ff:ff:ff:ff [18:15:39] XioNoX: you can't really, we don't forward in a way that will work like that [18:15:40] inet 10.64.53.21/24 brd 10.64.53.255 scope global eth0 [18:15:42] valid_lft forever preferred_lft forever [18:15:45] inet6 fe80::f21f:afff:fee8:bc3f/64 scope link [18:15:47] valid_lft forever preferred_lft forever [18:16:24] ema: what's up with 2014? [18:16:39] Apr 26 18:09:14 cp2014 varnishd[1828]: Child (1842) died signal=9 [18:16:53] I've started varnish-frontend by hand just now [18:17:01] the frontend died? [18:17:05] yep [18:17:14] with no useful info in the logs [18:17:34] Apr 26 18:09:14 cp2014 varnishd[1828]: Child (1842) died signal=9 [18:17:34] Apr 26 18:09:14 cp2014 varnishd[1828]: File allocation error _.vsm.1828: No space left on device [18:17:46] why don't I see that? [18:17:57] Apr 26 18:09:14 cp2014 kernel: [1237285.015793] varnishd invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0-1, order=0, oom_score_adj=0 [18:18:05] I'm looking at syslog [18:18:15] the oom-killer was the start of it [18:18:46] (we should maybe tune oomkill on these nodes to avoid varnish/nginx, but whatever, unlikely anything else causes the problem anyways) [18:19:36] so in syslog we've got this: [18:19:36] Apr 26 18:09:14 cp2014 varnishd[1828]: Child (1842) died signal=9 [18:19:36] Apr 26 18:09:14 cp2014 varnishd[1828]: File allocation error _.vsm.1828: No space left on device [18:19:52] journalctl -u varnish-frontend, instead, only shows the child death [18:20:02] that's not very cool [18:20:50] those cached-mem dips we observed before, did they come from increase memory-used pressure? [18:21:37] bblack, ema, (sorry to interrup), but how can I ssh to one of thoes non working v6 hosts? ssh-ing from bast host asks me for a password [18:21:59] XioNoX: you can use the management interface: ssh root@cp1072.mgmt.eqiad.wmnet [18:22:07] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp2014&var-datasource=codfw%20prometheus%2Fops&from=now-1h&to=now&refresh=1m&panelId=41&fullscreen [18:22:12] what's the password? [18:22:15] ema: ^ transient storage for frontend [18:22:20] XioNoX: root pw in pwstore [18:22:30] XioNoX: in pwstore, the files are: management and root_password [18:22:49] ema: so transient storage spike for FE => oom => oom-kill [18:23:04] ha [18:23:40] what a day to be alive [18:23:44] I know we talked about this very thing in the past several days [18:23:50] what were we looking at, at the time? [18:24:09] the idea of using hfp on upload backends, we were worried about transient storage usage [18:24:17] right [18:24:29] any chance some recent VCL change could be causing a new spike of hfp for frontend? [18:25:07] and from drac, how do I get to ssh? [18:25:17] XioNoX: console com2 [18:25:28] (will get you serial console where root pw will work) [18:25:54] bblack: nothing comes to mind wrt VCL changes [18:26:40] so via greg-g pointer to -sfaf [18:26:43] err -staff [18:26:55] 18:21 < varnent> FYI - sounds like Matanya is about to block more ISP IPs in Africa [18:26:58] 18:24 < greg-g> varnent: context being illegal file uploading again? [18:27:01] 18:24 < greg-g> s/illegal/copyright infringing/ [18:27:03] 18:24 < varnent> That is my understanding. [18:27:15] in the past we've had incidents around this that spiked upload traffic to particular backend nodes, etc [18:27:36] the idea being someone uploads a pirated movie to commons and links it on facebook and a bunch of mobile clients in some 3rd world country all request it [18:27:47] and we hash it to a common backend, and it saturates network link, etc [18:27:56] (large video file) [18:31:22] I'm in total complexity overload on the varnish situation now [18:31:33] we need fewer moving parts [18:31:39] yes [18:32:26] bblack, ema, igmp-snooping disabled [18:32:29] ok so the newer varnish package, still only installed on all 10x upload-codfw + 1x text-esams (3033) ? [18:32:43] XioNoX: ssh fixed, thanks! [18:32:50] bblack: correct [18:32:52] I confirmed that cp1072 has the new route [18:32:55] yeah that fixes IPv6 in general [18:33:13] now the question is how do we get back to a sane world with igmp-snooping enabled and not screw up v6? :) [18:33:26] because without igmp-snooping, we flood multicast everywhere [18:33:39] (we have heavy multicast traffic for purges -> cpNNNN hosts) [18:35:02] so I don't think cp2014's transient-storage spike is externally-induced [18:35:24] it must have somehow been related to a backend's misbehavior? [18:35:53] transient usage spikes at 17:53, let's see what else happened [18:36:36] (perhaps the way all related things work out: if a backend spikes up a bunch of 503 internally-generated responses, they become hfps in the frontend as they pass through?) [18:36:46] there's a spike in received/eth0 [18:37:20] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3215135 (10ayounsi) This was due to T133387. Hosts that have an igmp-snooping membership don't receive IPv6 RA and thus don't have a default rout... [18:37:24] but no increase in client request rate [18:38:36] but there's indeed a spike in backend connections [18:38:39] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp2014&panelId=16&fullscreen [18:38:59] as in frontend initiated connections to local varnish-bes [18:41:24] bblack: why would a 503 become a hfps? [18:42:37] however, 503 generated by varnish do end up in transient storage [18:43:28] so yeah, if a bunch of varnish-frontend connections end up on a varnish-be with issues that would trigger the generation of a bunch of 503s by the frontend, possibly leading to the type of oom we've seen [18:44:57] bblack: the code version with fixed igmp-snooping is going to be realeased late may, so we would have to upgrade the switch stack when it's out [18:48:40] ok [18:48:52] I guess we can live with the flooding for now, we did in codfw for a long time already [18:49:18] ema: so like I said there's a lot of moving parts, but it seems like the most prudent thing to do at this point is revert the package updates [18:49:25] it's the biggest question-mark in all of this [18:49:48] and the current behavior is more mysterious and annoying than the original mailbox lag IMHO :) [18:49:55] agreed :) [18:50:32] one thing we could do is make the behavior conditional with a startup parameter [18:50:52] but yeah, +1 on reverting [18:51:19] yeah, it would be nice I guess to have controls on it that don't involve package up/down grades [18:51:33] maybe take the time now and make the patch have params and upgrade to that where we already upgraded? [18:51:50] 1 param for turning on RT for expiry thread, and another for lock inheritance there? [18:52:05] sounds great [18:52:24] both disabled on frontends [18:52:34] but I donno, it's getting late for you already [18:52:47] and I don't have muscle-memory for rebuilding varnish packages anymore heh [18:53:14] shit it's 9 already :) [18:53:25] so maybe just revert for now, and I'll do it and start working on restarts [18:53:35] revert the package install I mean, using the cached copy in /var/lib/dpkg/ [18:53:38] or whatever [18:54:05] right, /var/cache/apt/archives/ [18:54:14] OK I'll do the param work first thing tomorrow [18:55:12] ok awesome [18:56:12] see you :) [18:56:24] cya! [19:28:25] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3215305 (10ayounsi) Another issues was: > ## inactive: interfaces interface-range vlan-analytics1-d-eqiad activating the range solved the issue a... [19:33:17] bblack: other option for IGMP snooping is to statically set the v6 default route using puppet for at least the hosts using igmp-snooping [19:34:36] yeah Faidon and I over the past couple of years had a number of lengthy discussions about how to better handle our IPv6 (and v4 for that matter) wrt to all of this [19:34:45] how we puppetize it, how we handle it in the installer, and at runtime [19:34:52] it's complicated a bit [19:35:14] getting rid of the dependency on RA was part of that, I think [19:35:46] yeah, servers shouldn't depend on RA [19:36:16] I think one of the reasons we like RA is for failover of the (VRRP?) default gw addr quickly though? [19:36:26] I don't remember the details, but I thought it was hard to do that for ipv6 otherwise [19:37:09] there's a related ticket somewhere that I'm now trying to find [19:38:30] ah found them [19:39:04] https://phabricator.wikimedia.org/T102099 [19:39:25] thanks, will read [19:50:02] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3215394 (10ayounsi) Because of miss-communication, the move of server uplinks from asw-d to asw2-d didn't happen today. We are rescheduling it a... [21:28:22] o/ [21:28:31] I need some help with an outage for ORES. [21:28:39] 14:29 < halfak> Is there some way we can just direct all traffic to eqiad right now? [21:28:45] Currently most traffic is directed at our codfw node. [21:28:48] ^ yeah that [21:29:21] codfw is barfing on our last deployment, but eqiad is not [21:29:40] So I need to fall back to eqiad ASAP and figure out why the deploy to codfw failed. [21:43:03] i switched it [21:43:09] https://gerrit.wikimedia.org/r/#/c/350487/ [21:43:15] and running puppet on misc::cache via cumin [21:45:27] 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3215870 (10JKatzWMF) @Gilles I know we're going to chat about this next week, but in the interest of time, I'd like to separate out w... [22:51:09] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3216190 (10DStrine) [22:56:12] mutante: that method of switching causes a 5xx spike (but luckily this is a low-traffic service) [22:56:39] the duration of the 5xx spike should only be during the cumin run, basically [22:59:30] (sorry it's not well documented outside of wikitech. But basically to switch from only-eqiad to only-codfw (or vice-versa) without a 508 spike during the deployment process, you have to do it in two stages, switching to active:active first in the middle)