[01:52:12] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ayounsi) Swapped A3 and A4 [02:06:20] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ayounsi) Swapped A1 with A2 and A4 with A5 [02:14:25] 10Traffic, 10Operations, 10ops-eqsin: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ayounsi) a:05ayounsi→03RobH [02:14:46] 10Traffic, 10Operations, 10ops-eqsin: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ayounsi) a:05ayounsi→03RobH [03:01:48] jynus: easier is to ping/email me, I monitor irc channels but don't read it all [03:02:06] bblack: let me know if you want to do the LVS work today [04:25:06] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [04:33:24] 10netops, 10Operations, 10ops-eqiad, 10ops-eqsin, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [07:06:08] XioNoX: do you need any help with those lvs? [07:11:01] vgutierrez: in https://phabricator.wikimedia.org/T213121 [07:11:13] jump to "Add BGP sessions between cr2-eqsin and LVS" [07:11:29] I have a CR ready (needs review) and router changes [07:21:23] CR looks good [07:22:04] I'm guessing that before merging that and restarting pybal on lvs5003 you'll change the static route to let traffic flow throw lvs5003 anyways, right? [07:28:38] hmm nevermind [07:29:04] lvs5003 plays along as the backup lvs for 5001 and 5002 [07:32:30] vgutierrez: want to do it now or wait for next week? [07:32:34] knowing that it's friday [07:32:55] yeah, taking into account that's Friday we should wait till Monday if you don't mind [07:35:13] no pb at all [08:21:44] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) acme-chief certificates have been deployed successfully: `vgutierrez@cumin1001:~$ sudo cumin 'R:File = /etc/acmecerts' 'md5sum /etc/centralcerts/* | sed s/centralcerts/acmecerts/ | md5sum... [08:22:10] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) [08:33:18] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) [10:21:47] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 (10Aklapper) [11:55:33] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Upgrade acme-chief to run in debian buster - https://phabricator.wikimedia.org/T215925 (10Vgutierrez) 05Open→03Resolved acme-chief is running successfully in debian buster :) [12:33:30] vgutierrez, did https://phabricator.wikimedia.org/T207371 get sorted? [12:33:57] Krenair: actually that's been fixed moving to buster [12:34:05] hah [12:34:06] okay [12:34:19] basically python3-josepy isn't part of stretch, is only around in stretch-updates [12:34:27] so I think that's the culprit [12:34:37] will close [12:34:39] thx [12:34:52] I commented some suspicious http traffic paterns on -operations, doesn't seem emergency-worthy [12:35:07] but in case someone wants to dig further (I won't) [12:35:09] 10Acme-chief: Investigate weird packaging warning - https://phabricator.wikimedia.org/T207371 (10Krenair) 05Open→03Resolved a:03Vgutierrez Sounds like @Vgutierrez may have fixed this in {T215925}, though not an exact dupe? Closing anyway [12:35:33] probably just a bot or something [12:36:39] vgutierrez, is https://phabricator.wikimedia.org/T209980 done? [12:36:55] it looks pretty done but I don't know the full scope of the original issue [12:38:34] right, that's fixed [12:39:55] 10Acme-chief, 10Patch-For-Review: certcentral crashes on network errors - https://phabricator.wikimedia.org/T209980 (10Krenair) 05Open→03Resolved a:03Vgutierrez [12:41:33] 10Acme-chief: Validate DNS-01 challenges against every DNS server - https://phabricator.wikimedia.org/T207461 (10Krenair) Dupe of {T203396} ? Or not quite? Definitely related either way. [12:41:37] * Krenair is tidying backlog [12:42:36] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737 (10Krenair) Is this resolved now? Also, duplicate of {T204997} ? [12:43:58] 10Acme-chief, 10Traffic, 10Operations, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [12:44:04] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737 (10Vgutierrez) 05Open→03Resolved yes, this has been included as part of the latest release [12:44:17] * vgutierrez lunch [12:44:36] 10Acme-chief, 10Patch-For-Review: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 (10Krenair) @Vgutierrez: is this done? [12:45:38] 10Acme-chief, 10Traffic, 10Operations: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Krenair) Has this been solved by {T213737}? [13:12:15] 10Acme-chief, 10Patch-For-Review: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 (10Vgutierrez) 05Open→03Resolved yeah, it was included as part of the 0.3 release [13:16:53] 10Acme-chief, 10Traffic, 10Operations: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Indeed. [13:17:45] well that cut it down by a fair number of tasks :) [13:18:23] closed 6, moved 1 to goals/tracking [13:19:37] of the remaining 13, 1 is already done by hand for wikimedia, 1 is non-prod, 1 is sort of research that we aren't likely to use in wikimedia prod [13:19:46] so... 10 to go for wm prod [13:21:01] but that includes things like "improve logging" and "improve how we check expiries" [13:21:13] vgutierrez, oh, what about https://phabricator.wikimedia.org/T203396 vs. https://phabricator.wikimedia.org/T207461 ? [13:22:29] one is talking about http-01 and the other one about dns-01, so I guess we could keep both of them [13:22:44] or create a task and two subtasks one for each challenge type [14:40:49] o/ [14:41:14] do yall know if we have a number of open connection limit somewhere for misc services? [14:41:15] https://phabricator.wikimedia.org/T215013 [14:41:29] a few weeks ago eventstreams hit a MAX_CONNCURRENT_STREAMS problem [14:41:31] and i don't know why [14:48:01] that task is rather vague... [14:48:22] regarding the 502 itself I mean [14:51:04] ottomata: I'm no expert, but quickly checking the VCLs I didn't find a 502 generated by varnish [14:51:26] of course it could be triggered by the nginx in front of it [14:51:56] hm [14:52:05] cause assuming a recent version of curl it will speak http2 by default and that's terminated by our nginx layer [14:52:08] yeah the task is vague, it happened while I was off somewhere (post all hands I think) [14:52:45] vgutierrez: looking in puppet, what roles/nodes are relevant? [14:52:48] curl 7.38.0 [14:53:10] so... http://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_concurrent_streams [14:53:20] oh hoho [14:54:13] reading more.,.. [14:54:14] https://youtu.be/wyKQe_i9yyo?t=47 [14:55:10] ottomata: so.. it looks like that our nginx layer doesn't set that parameter [14:55:26] rright but it defaults to 128 [14:55:35] ottomata: still.. that would be 128 stream connections per cp node [14:55:52] but, am confused, i think that it would be 128 http2 streams within a ssingle connetion [14:56:04] which would be weird for jynus to see when he curled [14:56:15] hmm that's right [14:56:17] yeah , I didn't create 128 of that, only one [14:56:43] maybe it was a red herring [14:57:12] but certainly it was objectivized by several people complaining at the time [14:57:29] which I couldn't repreduce at first when I trie from the cluster [14:57:34] at the time of the outage, the # of connections maxed at around 750 [14:57:36] but I could when I did it from my pc [14:57:41] aye [14:58:09] so.. from " I got a 502 error, potentially created by the following message: (MAX_CONCURRENT_STREAMS == 128)" [14:58:17] I don't know what to understand there [14:58:34] I got that on the headers [14:58:37] my curl is reporting Connection state changed (MAX_CONCURRENT_STREAMS updated)! on every http2 connection established [14:59:07] jynus: yeah, cause the nginx is letting curl know that it can use up to 128 concurrent streams [14:59:21] ok, so it was other error then [14:59:44] ? [15:01:04] I think so [15:01:17] let me capture the http2/tls traffic and confirm it... [15:01:35] * Connection state changed (MAX_CONCURRENT_STREAMS == 128)! [15:01:39] yes, I can see it [15:02:01] so the error was real, just that header was unrelated, mayybe [15:02:07] but it was the only content produced [15:03:06] ok, interesting, so probably not related to max streams then... [15:03:28] i don't know what the connection limit on this service would be, i'd expect more than ~750 [15:03:35] probably related to the load/concurrency, however [15:03:49] yeah it could, but then why the ok checks from the backend? [15:04:03] just intermittent? depends on when they were ran? [15:04:13] is there a routing difference from curling the same url from the outsinde and the inside? [15:04:29] I could not reproduce it from the inside of the cluster [15:04:35] oh that i don't know. jynus when you ran inside you curled from stream.wikimedia.org? [15:04:50] not from e.g. scb1001.eqiad.wmnet:xxxx etc. [15:04:51] ? [15:04:53] or from [15:04:54] I belive so [15:04:58] eventstreams.svc.eqiad.wmnet ? [15:05:09] but honestly, I don't remember right now [15:05:12] aye [15:05:42] I know something [15:05:47] all checks where green [15:05:55] so those were working [15:06:35] those that you get on icinga by searching eventstreams [15:06:53] aye, the checks are for uptime checks on the individual service nodes [15:07:02] aliveness* [15:07:09] for lvs pooling, etc. [15:07:17] very strange [15:07:27] anyway i'm adding a check for stream.wikimedia.org aliveness [15:07:32] hopefully it will catch this if it happens again [15:08:05] either that [15:08:11] or a grafana alert on concurrency [15:08:28] "to check if concurrency is to previous pathological levels" [15:08:50] I don't think MAX_CONCURRENT_STREAMS is an error at all [15:09:10] isn't that just an informational message as part of negotiating a http2 session? [15:09:15] I think so [15:09:25] and it's reporting the max number of concurrent streams.. 128 in our case [15:10:18] so if you see a lot of client concurrency, and you see it stuck on that part, it may have caught your attention, too [15:10:22] :-D [15:10:35] I am not an http2 expert [15:10:47] I just saw things down :-) [15:11:05] it's just doing multiple concurrent requests in the same http2 session [15:11:12] multiplexing [15:11:40] no, it was stuck/giving application errors [15:11:49] I think we are talking past each other [15:12:02] what I am saying is that messages about MAX_CONCURRENT_STREAMS are orthogonal to that [15:12:18] sure, and I was saying things were down at the time [15:12:32] if you saw messages about the server sending PROTOCOL_ERROR or REFUSED_STREAM that would be different [15:13:01] I saw no stream working :-) [15:13:36] I think different people are talking about different parts of stacks failing :) [15:13:44] oh, I am not saying what it is [15:13:45] also the word 'stream' is being overloadaed [15:14:04] bblack: agreed [15:14:04] yes, there are streams in your streams [15:14:07] but i think we all agree, whatever the problem was is not related to http2 streams. [15:14:09] haha [15:14:15] h2 ends in the nginx in the cp servers... the 502 is triggered by something else IMHO [15:14:16] I just meant "curl https://stream.wikimedia.org/v2/stream/recentchange" is not doing what it was supposed to do [15:14:33] someone look at it, I was in clinic duty :-) [15:15:12] right, so that curl would be creating a fresh http/2 connection, unaffected by any stream concurrency issue inside any older ones, at that level. [15:15:15] I made the mistake of not saving the full header [15:15:34] no, but I meant the application had strange concurrency [15:15:34] but then there's other things that can go wrong after that [15:15:57] (probably not significat from the traffic perspective) [15:16:14] for the traffic edge, only nginx speaks http/2. out the back of nginx towards varnish, that gets broken down into little http/1.1 conns. [15:16:19] https://grafana.wikimedia.org/d/000000336/eventstreams?panelId=1&fullscreen&orgId=1&from=1548329421995&to=1549526896457&var-stream=All&var-topic=All&var-scb_host=All [15:16:22] ^this [15:16:44] and eventstreams IIRC is one where varnish's behavior is basically to pipe through in the common case, so those in turn will create that many connections to the eventstreams applayer. [15:17:14] hmmm from that grafana dashboard point of view.. 1 http2 connection could translate to up to 128 connections, right? [15:17:19] I was just trying to give context to the ticket, as I wrote most of it [15:17:33] so if you have, say, 5 clients connected to https://stream.wikimedia.org/v2/... and they each open 128 concurrent streams inside their singular http/2 connection, the eventstreams backend service is going to see 640 parallel conns [15:18:11] which probably in some case we're seeing here exceeds some limit, possibly the varnish one we have limiting connections from any given backend varnish to eventstreams applayer? [15:18:47] bblack: do we have clients that do that? [15:18:53] note I saw the "Issues" on my own curl [15:19:03] I don't know, I assume since people are talking about seeing the MAX_CONCURENT_STREAMS thing [15:19:27] I don't think is a thing, I just noted it at the time in case it was related [15:20:13] the think that we may need help with is understanding why it failed outside of the network but not outside [15:20:28] *inside [15:20:55] not as a bad thing, just to setup proper monitoring [15:21:29] i think MAX_CONCURRENT_STREAMS is a red herring as jynus said, but ya. i highly doubt anyone is opening up multiple http2 streams to this service [15:21:46] bblack: whats the limit in varnish? [15:21:49] to a backend? [15:22:00] ottomata: on that stuff: [15:22:02] sorry from a backend to othe app layer [15:22:35] hieradata/role/common/cache/text.yaml has the config, and the default there is 1K (cache::app_def_be_opts::max_connections) [15:22:51] but then for eventstreams we have specific tighter config saying: [15:22:53] eventstreams: [15:22:53] be_opts: [15:22:53] port: 8092 [15:22:53] max_connections: 25 # https://phabricator.wikimedia.org/T196553 [15:22:58] oh ho ho! [15:23:12] there are only 6 backends, 6 scb nodes [15:23:20] which refs a ticket you wrote about how we need to keep that number small heh [15:23:27] he he [15:23:33] well well well [15:23:45] ottomata: your worst enemy, ottomata showed up! [15:23:54] that past guy [15:23:56] what a jerk [15:23:57] the past me [15:24:06] sometimes he treats me well, but other times... [15:24:18] I think it is an ok meassure, just slap some monitoring on that concurrency, move one [15:24:29] *on [15:25:03] just keep in mind there can be ~8 varnishds facing your eventstream service in each DC (codfw/eqiad) [15:25:15] and technically the max_conns there is per-varnish [15:26:21] but also, requests land at those 8 varnishes based on hashing the request URI, so if they're mostly all the same request URI (https://eventstreams.wikimedia.org/v2/stream/recentchange), then the whole limit for a DC will basically be that per-varnish limit, as all such requests funnel through one varnish backend based on the URI chashing. [15:26:29] k, and the app layer varnish sees is the lvs endpoint? [15:26:32] right? [15:26:44] ah [15:27:05] yes, varnish reaches eventstreams via LVS [15:27:07] interesting, so only one varnish backend will handle that url? [15:27:26] well, maybe [15:27:40] we have around 50 steady connected clients to that url [15:28:11] bblack varnish can't cache this, can we disable url hasing/balancing for this service? [15:28:36] I think we have some generic VCL in place that does that actually (if pass-mode, don't do chashing, randomize) [15:28:55] but I'd have to dig quite a bit within the complexity of our VCL to find out if it really ends up applying in this case or not [15:29:09] ah ok [15:29:28] i asssume that is probably true, since we have more than 25 connected recentchange clients [15:30:15] wow reading myself in the past, what a great ticket. not even that long ago... [21:14:35] 10Traffic, 10Operations, 10Parsoid, 10RESTBase, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10mobrovac) We have logged some VE requests on the RB side, and it turns out we cannot rely on the session ID to be present in the request. VE does send it, but o...