Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-serviceops

Filter:
Start date
End date

Displaying 140 items:

2020-06-15 07:09:03 <jayme> akosiaris: hnowlan: was it really just the memory limit that needed to be increased for changeprop?
2020-06-15 07:09:40 <wikibugs> 'serviceops, ''Operations, ''ops-codfw, ''Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (''ayounsi) >>! In T252185#6221836, @akosiaris wrote: > kubernetes2007 has been reimage successfully,...'
2020-06-15 09:13:57 <hnowlan> jayme: might have been CPU alongside memory, but memory was the fix that finally resolved things
2020-06-15 09:17:11 <jayme> hnowlan: hmm..this seems rather strange doesn't it? If I'm not totally wrong, chageprop now even uses less memory than before the change
2020-06-15 09:22:53 <jayme> Oh, well. Thats wrong actually...this is just the "some changeprops are different than others" pitfall :)
2020-06-15 09:23:12 <hnowlan> jayme: yeah, it's quite strange. I was wondering whether there was one spike/burst of objects that needed to get into memory that was clogging processing but the graphs don't show that
2020-06-15 09:24:18 <hnowlan> the biggest pod in terms of memory is still using ~20% less since the increase
2020-06-15 09:25:26 <hnowlan> gc spikes went *way* down after akosiaris bumped CPU so it looks like it was both resource increases
2020-06-15 09:25:32 <wikibugs> 'serviceops, ''Operations, ''Wikidata: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (''Addshore) Is T253673 relevant here?'
2020-06-15 09:27:15 <jayme> ah, okay. So it could have been the GC not beeing able to free up memory
2020-06-15 09:45:58 <akosiaris> just figured out that the down arrow in https://grafana.wikimedia.org/d/000300/change-propagation?panelId=30&fullscreen&orgId=1&refresh=30s is a hardcoded down arrow
2020-06-15 09:46:15 <akosiaris> I expected it to have some meaning but it's a hardcoded prefix
2020-06-15 09:48:31 <apergos> lol
2020-06-15 09:53:43 <addshore> o/ It looks like we might be having some problems with the termbox ssr service for wikidata
2020-06-15 09:53:45 <addshore> https://logstash.wikimedia.org/goto/808228af4133ed6e857a168de0c6cd4c
2020-06-15 09:53:49 <addshore> https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?orgId=1&refresh=1m&from=now-7d&to=now
2020-06-15 09:54:20 <addshore> mediawiki seems to be getting elevated errors trying to connect to the service, but the service itself might not even be getting the connections? (I see no increase in requests)
2020-06-15 09:54:32 <addshore> We made a ticket stub of https://phabricator.wikimedia.org/T255410
2020-06-15 09:55:23 <addshore> The error content is "upstream connect error or disconnect/reset before headers. reset reason: connection termination", I also see headers.server as envoy
2020-06-15 09:55:55 <addshore> any ideas would be greatly appreciated! :D
2020-06-15 09:56:25 <akosiaris> anyway, from the looks of it, I agreed that both resources needed a bump. From https://grafana.wikimedia.org/d/000300/change-propagation?panelId=56&fullscreen&orgId=1&refresh=30s&from=now-24h&to=now I get that the 1 CPU limit is routinely reached, even if in spikes during the day. And we couldn't fix that until we also bump the memory limits. But interestingly now the max() is pretty low
2020-06-15 09:56:26 <addshore> Right now it doesn't appear to have a big (if any) user facing impact, as we have a fallback for when SSR fails,
2020-06-15 09:56:56 <akosiaris> jayme: ^ termbox issues it seems
2020-06-15 09:58:15 <jayme> looking
2020-06-15 09:58:22 <akosiaris> seems like the TLS proxying in between is terminating connections while they are being ready to be reused
2020-06-15 09:59:22 <akosiaris> 503 UC indeed in mw's envoy logs
2020-06-15 10:03:04 <addshore> it looks like this was happening a bit prior to this increase in logs too, but the increase seems to be mainly this case
2020-06-15 10:06:42 <akosiaris> yeah this is probably https://phabricator.wikimedia.org/T254581
2020-06-15 10:06:53 <akosiaris> it seems to time with the deployment of that
2020-06-15 10:07:41 <jayme> yep...although I don't see anything specific in the logs of termbox envoy
2020-06-15 10:09:12 <jayme> could it bee that this is a bad upstream_timeout akosiaris?
2020-06-15 10:09:43 <akosiaris> jayme: yeah, my thoughts exactly, I am just chasing down what the problem was last time we encountered this.
2020-06-15 10:10:30 <jayme> It's set to 180s in envoy but I guess "this is good or bad" depends on what the upstream systems thinks is good or bad
2020-06-15 10:11:17 <addshore> tarrow: ^^ thoughts? :P
2020-06-15 10:11:22 <wikibugs> 'serviceops, ''Operations, ''Release-Engineering-Team-TODO, ''Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (''jbond) p:''Triage''Medium'
2020-06-15 10:14:24 <akosiaris> jayme: I have a nagging feeling this is a race. Some connection gets reused on the mw envoy side, right on the brink of being closed by the envoy sidecar of termbox
2020-06-15 10:16:45 <akosiaris> jayme: ok, found the ticket https://phabricator.wikimedia.org/T249705
2020-06-15 10:17:52 <akosiaris> coding a patch
2020-06-15 10:19:17 <jayme> hmm...should we maybe habe a generic retry logic at least for UR and UC?
2020-06-15 10:20:32 <jayme> Seems wrong to me to have to figure that out for every service...but maybe that's hiding a different root cause, though (like the race you mentioned)
2020-06-15 10:23:00 <akosiaris> jayme: At least for all the services that do this envoy <-> envoy dance I 'd say.
2020-06-15 10:23:29 <akosiaris> which is not all of them, but enough to warrant somehow avoiding this duplication (if I am right and it's this)
2020-06-15 10:24:06 <akosiaris> jayme: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/605554
2020-06-15 10:24:49 <akosiaris> hmmm
2020-06-15 10:24:58 <akosiaris> I have a keepalive of 4.5 and a timeout of 2
2020-06-15 10:24:59 <akosiaris> that's wrong
2020-06-15 10:25:01 <akosiaris> fixing
2020-06-15 10:27:33 <jayme> akosiaris: why would you have ti set keepalive?
2020-06-15 10:27:38 <jayme> *to
2020-06-15 10:30:28 <akosiaris> jayme: mostly because of this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/581023/
2020-06-15 10:31:01 <akosiaris> that is, sent the keepalive before the sidecar envoy decides that it's keepalive timeout has been reached
2020-06-15 10:31:18 <akosiaris> I am assuming btw, that value is still 5s btw, that assumption might no longer be true
2020-06-15 10:31:47 <jayme> checking
2020-06-15 10:32:39 <jayme> I'ts still 5s by default
2020-06-15 10:32:45 <jayme> (and for termbox specifically)
2020-06-15 10:33:50 <hnowlan> akosiaris: changed those ambiguous icons on the changeprop dashboard, they're pretty misleading
2020-06-15 10:34:03 <akosiaris> hnowlan: thanks, <3
2020-06-15 10:35:11 <akosiaris> jayme: should I merge?
2020-06-15 10:35:40 <akosiaris> and we 'll see what happens in ~30m or so (the amount puppet takes at most to run across all the mws)
2020-06-15 10:35:50 <jayme> +1
2020-06-15 10:36:33 <akosiaris> ok
2020-06-15 10:37:04 <jayme> the timeout seems pretty high as well but as termbox has 180s ... *shrug*
2020-06-15 10:37:48 <akosiaris> indeed
2020-06-15 10:38:04 <akosiaris> I don't think it will play much a part in this one, but we 'll see I guess
2020-06-15 10:38:25 <akosiaris> addshore: change merged, it will take ~30m to be fully deployed. Let's reassess then
2020-06-15 10:46:22 <addshore> great!
2020-06-15 10:47:41 <addshore> o/ micgro42 !
2020-06-15 10:47:52 <micgro42> :)
2020-06-15 10:48:20 <addshore> micgro42: chat log https://wm-bot.wmflabs.org/browser/index.php?start=06%2F15%2F2020&end=06%2F15%2F2020&display=%23wikimedia-serviceops
2020-06-15 10:48:36 <micgro42> already reading it
2020-06-15 10:48:39 <addshore> :D
2020-06-15 10:48:48 <addshore> will add it to the ticket for ease of remembering
2020-06-15 10:48:55 <micgro42> the link in the channel topic actually works for once^^
2020-06-15 10:49:36 <addshore> I wonder if it should link to the pretty chat log browser though https://wm-bot.wmflabs.org/browser/index.php?display=%23wikimedia-serviceops ;)
2020-06-15 11:15:28 <micgro42> Got better, but a new (much rarer) error appeared that wasn't there previously: https://logstash.wikimedia.org/goto/101f99c4fd32fb13f3f506be771c99ab
2020-06-15 11:17:09 <micgro42> > Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server
2020-06-15 11:17:09 <micgro42> > Request failed with status 0. Usually this means network failure or timeout
2020-06-15 11:18:02 <micgro42> rate seems to be about 1/10th of the previous error
2020-06-15 11:22:32 <jayme> That is no longer logged at envoy level. So it must be something between mw and local envoy
2020-06-15 11:23:09 <akosiaris> yeah, I see we have solved the upstream "upstream request timeout" issue. That's good
2020-06-15 11:23:32 <akosiaris> what's that status 0 thing?
2020-06-15 11:23:40 <akosiaris> HTTP status? can't be, right?
2020-06-15 11:24:00 <addshore> https status 0 can indacte things like timeout quite often
2020-06-15 11:24:09 <addshore> as in, the client decides that the status is 0
2020-06-15 11:24:46 <akosiaris> it's probably just the return code of whatever function was called I guess?
2020-06-15 11:24:56 <akosiaris> cause there is no HTTP status 0 of course.
2020-06-15 11:42:52 <jayme> Maybe the retry in envoy takes to long for the client sometimes and it gives up?
2020-06-15 11:54:10 <akosiaris> timing up on the mw side?
2020-06-15 11:54:45 <jayme> yeah
2020-06-15 11:54:47 <akosiaris> there's nothing on envoy logs as you said... weird
2020-06-15 11:55:06 <akosiaris> could be. Not sure what the timeout is there
2020-06-15 11:56:42 <jayme> https://grafana.wikimedia.org/d/yOu7f3iMk/jayme-envoy-telemetry?orgId=1&from=now-4h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=termbox
2020-06-15 11:59:53 <akosiaris> spikes of 4s p99?
2020-06-15 11:59:54 <akosiaris> sigh
2020-06-15 12:02:10 <jayme> Maybe a retry of 2 is to much (I now see that your commit msg sais 1 btw. :)
2020-06-15 12:05:44 <jayme> termbox has a quite "high" baseline of 5xx itself. Those request (ending up 5xx in the termbox backend in any case) will not be fiered 3 times, no? (1 + 2x retry)
2020-06-15 12:05:55 <jayme> https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?orgId=1&from=1590438094987&to=1592222649826
2020-06-15 12:20:14 <addshore> is mw1276 anything special (canary?) how can I see?
2020-06-15 12:22:53 <jayme> It's a canary - cumin sais
2020-06-15 12:24:54 <volans> addshore: git grep in puppet usually helps, in this case:
2020-06-15 12:24:55 <volans> conftool-data/node/eqiad.yaml: mw1276.eqiad.wmnet: [apache2,nginx,canary]
2020-06-15 12:25:04 <addshore> ack! ty!
2020-06-15 12:25:16 <volans> but the output might vary, so dont' take it as a done method
2020-06-15 12:26:21 <jayme> Seems like termbox sees in increase in timeouts itself https://logstash-next.wikimedia.org/goto/7d6eef16277df852623a30e2a8714e63
2020-06-15 12:27:15 <jayme> do we know what this 3000ms timeout is?
2020-06-15 12:29:20 <akosiaris> termbox has such a timeout
2020-06-15 12:29:33 <akosiaris> MEDIAWIKI_REQUEST_TIMEOUT
2020-06-15 12:29:46 <akosiaris> so, mediawiki timing out responding to mediawiki after 3s?
2020-06-15 12:30:40 <jayme> scratches head
2020-06-15 12:30:54 <akosiaris> jayme: about the 1+2x, that would be my reading of it. 2 retries + 1 original request. But only on 5xx
2020-06-15 12:31:13 <akosiaris> but don't take my word for granted. I am still on envoy level 1
2020-06-15 12:32:52 <jayme> akosiaris: Yeah. Assuming termbox returns a 5xx in the 3000ms timeout case, this means 9000ms wait untill the client-ms would get a response from it's envoy
2020-06-15 12:35:16 <jayme> But I don't get the timeout part in termbox... (I'm termbox level 0.15 I guess :). It get's requested from mediawiki (wikidata) and then does request to (a) other mediawiki instance(s) itself (and those time out)?
2020-06-15 12:36:12 <addshore> ^^ user req to mw -> requests to termbox service -> request to another mw service
2020-06-15 12:36:30 <akosiaris> yes, that's the architecture currently. It's not meant to be the end game however
2020-06-15 12:36:37 <addshore> https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Architecture second diagram
2020-06-15 12:36:46 <akosiaris> at least that was my understanding back when then based on that ^
2020-06-15 12:48:41 <jayme> I see, thanks. So it's requests to http://api-ro.discovery.wmnet/w/index.php that time out (if termbos loggs the requests it emmits in the request field)
2020-06-15 12:50:06 <tarrow> yes!
2020-06-15 12:50:08 <jayme> There also is a DEFAULT_REQUEST_TIMEOUT of 3000 in https://github.com/wikimedia/wikibase-termbox/blob/d5c39550f52d8bd4b5f5c0add36c148003f3afe5/src/common/constants.ts#L6 maybe we should use that as source of truth to figure envoy timeouts
2020-06-15 12:50:28 <jayme> giggles ...termboss
2020-06-15 12:52:01 <tarrow> To explain what is going on: if the service gives up at 3s then the caller fallsback to a clientside implementation; the 3s was picked becuase we figured past that point it's better to show the user something rather than making them wait
2020-06-15 12:54:49 <jayme> just to be sure: service == termbox and caller == MW in that case, right?
2020-06-15 13:00:19 <jayme> (the # of requests timing out seem to constantly increase in the last weeks btw. https://logstash-next.wikimedia.org/goto/05f9a05c7763f95efccc09aed9a32e9e)
2020-06-15 13:04:08 <wikibugs> 'serviceops, ''Operations, ''Patch-For-Review, ''User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (''elukey) A little note about the last patch merged. There are two main memcached parameters that can influence the distribution of the slab classes' chunk size:...'
2020-06-15 13:04:38 <elukey> rzl: o/ if you have time let me know if I am crazy or not --^
2020-06-15 14:36:59 <rzl> elukey: seems reasonable to me -- I'd really be interested in knowing the actual distribution of key sizes too, but I don't know if that's easy
2020-06-15 14:50:00 <rzl> uh, value size, not key size, but you know what I mean :)
2020-06-15 14:51:58 <elukey> rzl: ah yes, so some is in https://grafana.wikimedia.org/d/000000317/memcache-slabs
2020-06-15 14:52:16 <elukey> but it tells us only how many keys per slab size etc..
2020-06-15 14:52:22 <elukey> i tried to add some breakdowns
2020-06-15 14:55:22 <rzl> oh that does help though
2020-06-15 15:18:59 <elukey> the current gutter breakdown is not super meaningful due to key capped to 10 min TTL etc..
2020-06-15 15:19:08 <elukey> but I think it is helpful for the main shards
2020-06-15 15:19:24 <elukey> we'll have to squeeze 180 slabs into 64
2020-06-15 19:40:33 <wikibugs> 'serviceops, ''Operations, ''Release Pipeline, ''Release-Engineering-Team-TODO, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (''jeena) I created a helm test and got the integration and functional tests running in minikube. Do...'
2020-06-15 19:53:49 <wikibugs> 'serviceops, ''Operations, ''Release Pipeline, ''Release-Engineering-Team-TODO, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (''thcipriani) >>! In T224041#6225405, @jeena wrote: > I created a helm test and got the integration...'
2020-06-15 19:54:31 <wikibugs> 'serviceops, ''Operations, ''Thumbor, ''Patch-For-Review, ''Performance-Team (Radar): Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (''Gilles)'
2020-06-15 19:58:42 <wikibugs> 'serviceops, ''Operations, ''Release Pipeline, ''Release-Engineering-Team-TODO, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (''jeena) >>! In T224041#6225462, @thcipriani wrote: > PipelineLib could clean up an image from the r...'
2020-06-15 21:25:53 <wikibugs> 'serviceops, ''Operations, ''Performance-Team, ''Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (''Krinkle) a:''aaron'

This page is generated from SQL logs, you can also download static txt files from here