[11:45:35] <_joe_>	 jbond42: all restbase hosts fail on the compiler with error
[11:45:37] <_joe_>	 Error: Failed to compile catalog for node restbase2018.codfw.wmnet: source sequence is illegal/malformed utf-8
[11:45:49] <_joe_>	 I kinda-remember having solved this problem in the past
[11:46:03] <_joe_>	 maybe you have a fresher memory than me
[11:47:35] <jbond42>	 _joe_: yes the problem is that the puppet master tool no longer support pson the fix is this https://phabricator.wikimedia.org/T236481
[11:48:19] <_joe_>	 you have a patch to do it on the compiler right?
[11:48:25] <jbond42>	 https://phabricator.wikimedia.org/T238053
[11:48:27] <_joe_>	 maybe we should start doing so
[11:48:31] <jbond42>	 yes i do
[11:48:52] <jbond42>	 ill do some testing today and push it on monday
[11:49:15] <_joe_>	 ack!
[11:49:30] <_joe_>	 hnowlan: I suggest we wait monday to better look at that patch
[11:49:34] <_joe_>	 :P
[11:49:42] <jbond42>	 however i dont think rich_data is supported in puppet 4.8 there would be a conflict with wmcs if we enable it and they are still on 4.8
[11:51:01] <jbond42>	 however we should be able to get the compiler working regardless of that
[11:57:35] <moritzm>	 we could still make rich_data opt-in for roles not suitable for Cloud VPS, the restbase/cassandra classes are very prod-centric
[11:59:20] <_joe_>	 yeah but they're needed in wmcs
[12:17:33] <jbond42>	 its worth noting that this problem only affects `puppet master --compile` i.e. pcc, it dose not affect normal puppet runs as the master falls back to pson.  https://tickets.puppetlabs.com/browse/PUP-8187.  also the patch i have to PCC is to allo users to test using rich_data and therefore supporting the new binary format.
[12:18:11] <jbond42>	 we could try to patch the puppet code with something like the following https://phabricator.wikimedia.org/P10344  (completly untested) which might get the compiler working with the current code
[12:28:45] <XioNoX>	 jbond42: seems like https://gerrit.wikimedia.org/r/c/operations/puppet/+/562852 broke puppet on several cloud hosts
[12:29:11] <jbond42>	 got a cr up now just waiting for ci https://gerrit.wikimedia.org/r/c/operations/puppet/+/570880
[12:36:32] <jbond42>	 thanks XioNoX should be fixed now
[12:44:29] <XioNoX>	 cool, just had my wake up look at icinga
[12:49:50] <jbond42>	 _joe_: i applied the patch above and was able to run you pcc from the cli https://puppet-compiler.wmflabs.org/compiler1003/1/
[15:02:03] <jynus>	 I am on the -SRE channel!
[15:02:11] <XioNoX>	 yo
[15:02:13] <cdanis>	 no, I am
[15:02:16] <_joe_>	 so, I think in terms of rate-limiting
[15:02:18] <jynus>	 we both are!
[15:02:25] <_joe_>	 oh come on, let's be serious please
[15:02:32] <jynus>	 :-)
[15:02:39] <jynus>	 trying to kick of activityª!
[15:03:35] <_joe_>	 In terms of rate-limiting, it's resonable if we start to limit any request that causes a reparse. Like any ip out of a whitelist should not be able to do more than N calls per minute with wrapoutputclass=wiki-article
[15:03:51] <_joe_>	 we can for now limit on that condition and some other data we know of known abusers
[15:04:00] <_joe_>	 but we need to start such a conversation now
[15:04:07] <cdanis>	 _joe_: do you imagine this being enforced on the traffic side or the appserver side?
[15:04:23] <_joe_>	 cdanis: ideally it should be two different levels of limiting
[15:04:34] <_joe_>	 but probably if I have to start, on the appservers side
[15:06:32] <godog>	 would the rate limit be at apache level or mediawiki or sth else on the appservers ?
[15:06:46] <_joe_>	 yes
[15:06:48] <_joe_>	 :P
[15:06:55] <_joe_>	 so mediawiki has its own mechanisms
[15:07:32] <effie>	 what is more efficient though
[15:07:33] <marostegui>	 We already saw the issue with ratelimiting some months ago with that scrapper from that university and trying to scrap all recentchanges from enwiki or whatever it was
[15:07:34] <_joe_>	 to be clear, what Amir1 did (blanking those templates)
[15:07:57] <_joe_>	 probably solved the sources of outages from this specific scraper
[15:08:27] <_joe_>	 so on the mediawiki sitde I opened https://phabricator.wikimedia.org/T243803
[15:08:39] <_joe_>	 a week ago already :)
[15:09:06] <_joe_>	 but I think we need a more general idea of how rate-limiting should be applied.
[15:09:23] <marostegui>	 _joe_:that's the one we faced during all hands?
[15:09:27] <marostegui>	 yes, that is it
[15:09:38] <Amir1>	 do we know with confidence that the scrapper stopped or the changes fixed the issue
[15:09:43] <_joe_>	 First of all, we should have some cost function (I think the best one is wall clock time, but we'll go back to that) that gives us the cost of each request
[15:09:49] <_joe_>	 Amir1: 100%
[15:09:55] <Amir1>	 \o/
[15:09:56] <_joe_>	 I tried the urls that timed out before
[15:10:04] <_joe_>	 now they take ~ 1 s to reparse
[15:10:10] <_joe_>	 down from 180 :P
[15:10:18] <Amir1>	 >180
[15:10:23] <Amir1>	    >180
[15:10:48] <Amir1>	 Maybe we should have some sort of limit for switches or pages being transcluded in core
[15:10:56] <apergos>	 +1 for that
[15:12:03] <apergos>	 the 60 second timelimit in any case is a problem when the same page or set of pages is accessed rapidly, so figuring out how to proactively limit would be lovely
[15:12:27] <apergos>	 otherwise it's whack-a-mole
[15:12:30] <_joe_>	 so my idea is something like
[15:12:51] <godog>	 unless there are objections I'm wrapping up the incident gdoc, we're confident it was an aggressive scraper + expensive templates (as opposed to babel ?)
[15:12:54] <_joe_>	 the http server (or the TLS terminator, whatever stays between php-fpm and the users)
[15:13:04] <_joe_>	 godog: 99.99%
[15:13:14] <godog>	 ok thanks _joe_
[15:14:15] <_joe_>	 should know from some specific header the cost of the request on the backend, and add it to the global bucket for the current user (more on this later)
[15:14:30] <_joe_>	 and start returning 429 when a user exausts their cost-credits per minute
[15:14:48] <_joe_>	 so, back to the cost function
[15:15:09] <_joe_>	 I think wall clock is the best representation of "how much cpu time we overall spent on this request" in most cases
[15:15:17] <cdanis>	 I think it's the single best metric yes
[15:15:18] <_joe_>	 including databases
[15:15:23] <cdanis>	 there is a lot of other things I would like to track
[15:15:24] <_joe_>	 and all the backend services
[15:15:29] <cdanis>	 but wallclock is fine to start with for limiting
[15:15:47] <_joe_>	 for knowing the user, I would say we start signing the session cookies from MediaWiki
[15:16:08] <_joe_>	 so that we can know from the signature if they're valid, and which class of users they come from
[15:16:20] <_joe_>	 (this was all discussed in some form with bblack btw)
[15:16:29] <apergos>	 what does 'class of users' mean here?
[15:16:31] <_joe_>	 that shoud correspond to a quota
[15:16:38] <_joe_>	 apergos: say one bot from toollabs
[15:16:44] <_joe_>	 or one of our internal services
[15:16:54] <_joe_>	 they're in a specific class with larger limits
[15:17:07] <apergos>	 gotcha
[15:17:14] <effie>	 cookies are not always used though
[15:17:29] <apergos>	 a reader might not have a session cookie
[15:17:39] <effie>	 an api client might not either
[15:17:43] <_joe_>	 apergos: the reader will have a quota for their ip
[15:17:50] <cdanis>	 so they get a stricter per-IP ratelimit
[15:17:52] <_joe_>	 api clients will start to log in
[15:17:53] <cdanis>	 if they aren't authed
[15:17:54] <_joe_>	 :)
[15:18:19] <_joe_>	 or you know, use tokens
[15:18:25] <_joe_>	 anyways, I need to write this all down
[15:18:27] <apergos>	 ok so; classes of users with per-class limits; session cookies for authed users; per-ip limits for anon users
[15:18:28] <effie>	 are we discussing a long long term solution?
[15:18:34] <Amir1>	 Can someone make a ticket for that? I need to cook "lunch"
[15:18:37] <effie>	 I was under the impression we were looking into a short one
[15:18:43] <_joe_>	 Amir1: you need to study!
[15:18:56] <_joe_>	 a short what effie?
[15:18:57] <Amir1>	 oh that one as well :D
[15:19:17] <effie>	 _joe_: a short term solution
[15:20:11] <apergos>	 we need also to put on the todo list: single "bad" page with heavy traffic from lots of users (just to not lose track of that issue, though it's separate from today's)
[15:22:45] <effie>	 should we add those ideas in a task
[15:23:26] <effie>	 and discuss them further with anyone interested/needed to implement ?
[15:24:03] <_joe_>	 effie: for what?
[15:24:05] <_joe_>	 we did find a short-term solution for zhwiki
[15:24:07] <_joe_>	 remove the bad templates, ask the community to use the best practices
[15:24:11] <_joe_>	 effie: once they're fully formed, they'll become an RfC I think
[15:24:13] <_joe_>	 in a couple weeks, hopefully
[15:24:38] <effie>	 _joe_: ok I see
[15:26:22] <_joe_>	 effie: short term I only see making it more user-friendly to ban ips
[15:26:28] <cdanis>	 yes
[15:26:31] <cdanis>	 KR for this Q
[15:29:03] <effie>	 isnt blocked_nets for that purpose?
[15:30:04] <_joe_>	 I said more user friendly
[15:49:42] <Amir1>	 a quick update, the community brought back the templates in zhwiki but using wikidata, It was this simple: https://zh.wikipedia.org/w/index.php?title=Template%3AFrench_Commune_Data%2FElevationMax&type=revision&diff=58035442&oldid=58035007
[15:51:34] <jynus>	 english link: https://zh.wikipedia.org/w/index.php?title=Template%3AFrench_Commune_Data%2FElevationMax&type=revision&diff=58035442&oldid=58035007&uselang=en
[16:16:03] <_joe_>	 Amir1: lemme try it
[16:16:28] <_joe_>	 parses in less than 2 seconds
[16:20:28] <Amir1>	 for me this is slightly slow but less than 10 sec. I guess there are some more leftovers but not big enough to cause trouble
[16:20:28] <Amir1>	 https://zh.wikipedia.org/wiki/%E6%B2%99%E7%89%B9%E5%B0%94
[16:23:00] <addshore>	 o/
[16:23:14] <tarrow>	 o/
[16:23:18] <apergos>	 addshore: twentyafterfour hoo resuming the deploy/not deploy discussion. can someone give a short summary: here is what is broken, here's how long, here's why it's important to be fixed now
[16:23:19] <apergos>	 ?
[16:23:25] <addshore>	 unrelated to what Amir1 is doing, I would like to backport https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/570911/ for the reasons listed in https://phabricator.wikimedia.org/T244529#5859927
[16:25:14] <mark>	 i think the big question is
[16:25:40] <mark>	 do we understand why we had downtime during the .18 rollout yesterday, and again with the change made today?
[16:25:47] <addshore>	 Yes
[16:25:54] <tarrow>	 apergos: Due to a historic typo in wmf-config a change to the wikibase defaults that rolled with the train is being used in production. This means that Wikibase clients are now trying to read from tables that are only partially populated
[16:25:54] <mark>	 could you summarize that?
[16:26:22] <addshore>	 mark: unexpectedly all group2 sites started reading from the new terms storage tables, as this default is riding with .18 and the mediawiki-config setting has a tyop so it was not overridden there
[16:26:43] <apergos>	 this (reading from partially populated tables) started when? yesterday with .18 rollout?
[16:26:54] <tarrow>	 this means that group0 and 1 sites are currently showing bad / missing data
[16:26:55] <addshore>	 apergos: yes, it has been riding the train
[16:27:39] <addshore>	 apergos: so group0 getting bad data from tuesday, group1 on wednesday etc
[16:27:53] <apergos>	 and this is true for wikidata and all client wikis, right? in group 0, 1
[16:28:00] <tarrow>	 this is why we probably ought to act today
[16:28:09] <addshore>	 apergos: yes all clients specificaly
[16:28:28] <mark>	 so why did that cause everything to blow up twice?
[16:28:39] <addshore>	 the wikibase repo side of things doesnt have a typo, so is getting and has been getting the settings in mediawiki-config since mid last year
[16:29:32] <addshore>	 mark: so right now group2 is not reading from the new terms tables at all for items, both the train rolling forward and fixing the config typo make group2 read from these tables for some number of items
[16:29:35] <apergos>	 what do you think about the idea of rolling back group 0, 1 to .16 and leaving it that way for the weekend, as a more 'known state'?
[16:29:53] <addshore>	 apergos: Also fine with that if you all think that is fine
[16:30:13] <mark>	 so it's not reading from there, why did that cause problems for everything?
[16:30:40] <addshore>	 well all of the reads end up going to s8, i guess everything backed up a bit because of that alone
[16:30:43] <twentyafterfour>	 +1 to rolling all back to wmf.16
[16:30:45] <mutante>	 +1 to apergos'
[16:31:11] <apergos>	 do we have any patches in .18 that were like the babel patch etc that arent in 16?
[16:31:47] <apergos>	 addshore:  Amir1 you would be the ones to know, right?
[16:31:52] <addshore>	 mark: the typo fix in the config meant that all clients started reading from the new terms tables for 10,000,000 items, which was not the case on group2 already as that was only on .16
[16:32:04] <addshore>	 apergos: the Babel timeout is on both .16 and .18
[16:32:07] * addshore thinks about other things
[16:32:29] <mark>	 so i'm reading the incident report, https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki
[16:32:43] <mark>	 it's still very meager on details on what the secondary effects were, "root cause unknown"
[16:32:52] <mark>	 we don't know why the canaries didn't catch it
[16:33:15] <mark>	 there's nothing about all reads ending up on s8 in there and analyzing that
[16:33:16] <mutante>	 though on the first .18 deploy gone bad we had alerts/outage for restbase/varnish/apache but not s8 db?
[16:33:21] <mark>	 so I don't know why we'd be confident that this will go well now
[16:33:57] <addshore>	 I havn't dug into the events around the train quite yet, but this config will definitely have come with .18
[16:33:59] <cdanis>	 mark: there's been a pattern multiple times in the past few months.  1) change is rolled out where the failures only happen at full scale, 2) canary checks pass because they look at a handful of seconds on a handful of servers, 3) change fails at scale, 4) the revert fails the canary checks
[16:34:26] <mark>	 yes i'll believe that
[16:34:36] <cdanis>	 (the use-TLS-call-for-eventgate change was another example)
[16:34:49] <twentyafterfour>	 cdanis: yeah that's a tough pattern to catch in general but we should be able to do a better job
[16:34:59] <addshore>	 During the config change we just tried to do which caused the mini outage, all I could see was a few timeouts in terms of the mediawiki impact
[16:34:59] <cdanis>	 twentyafterfour: yeah for sure
[16:35:28] <cdanis>	 there's another interfering problem which is a class of outage that consumes all the PHP worker threads
[16:35:30] <mark>	 addshore: yes, so... it sounds like we don't really understand that yet
[16:35:30] <twentyafterfour>	 addshore: if this was the same issue as yesterday I'd expect massive timeouts all over the place but only after a full rollout
[16:35:42] <addshore>	 twentyafterfour: this was from earlier https://logstash.wikimedia.org/goto/c9affc21521d0b19a2ff129c918496d9
[16:35:58] <cdanis>	 AIUI the canary checks are pretty short in duration, so they only catch errors that fail-fast
[16:36:15] <addshore>	 twentyafterfour: so the train would switch to reading from the new term store for ALL items, the config change we mand only read from the new term store for 10,000,000
[16:36:21] <addshore>	 ^^ thats the key difference
[16:36:30] <addshore>	 and probably the reason there were slightly less timeouts, but still timeouts
[16:37:06] <twentyafterfour>	 addshore: that's much smaller than the scale of the problems yesterday but plausible yes
[16:37:12] <mark>	 things were fully down earlier today, as far as I can see
[16:37:59] <mutante>	 actionables to make scap reverting faster on https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki but it is missing one for the "why did canary not catch it" (mentioned in the "what went poorly" section though)
[16:38:02] <addshore>	 which earlier today?
[16:38:34] <mark>	 with the change hoo merged
[16:38:39] <mark>	 i don't fully understand how these relate, yet
[16:38:48] <twentyafterfour>	 mutante: https://phabricator.wikimedia.org/T183999 is a pre-existing task which is related to the canary checks
[16:39:03] <mutante>	 twentyafterfour: ah, thanks
[16:39:54] <twentyafterfour>	 mutante: it doesn't exactly address the situation at hand though, I guess it's just tangentally related
[16:39:55] <apergos>	 a quick look at patches from addshore and ladsgroup in gerrit shows no fixes in 18 that are not in 16 so I feel better about the rollback
[16:40:08] <tarrow>	 mark: The hoo config change and the train for to 18 both changed the same setting which changed where terms were read from
[16:41:00] <mark>	 marostegui or jynus still here by chance?
[16:41:03] <addshore>	 mark: one (the hoo cfg change) changed it to the value actually defined in mw-config (10,000,000) and the other (train) changed it to the default in wikibase.git which is infinite
[16:41:05] <jynus>	 yes
[16:41:08] <marostegui>	 yep
[16:41:21] <mark>	 can you confirm the theory that the full outage was caused by everything getting blocked on s8?
[16:41:32] <tarrow>	 of course there could also be another different problem with the train rolling too
[16:41:32] <jynus>	 the last one, or the one yesterday?
[16:41:40] <mark>	 both maybe
[16:41:47] <marostegui>	 I cannot confirm that, I did have to kill logs of queries earlier today
[16:41:51] <jynus>	 because the one yesterday I cocluded I saw no evidence of that
[16:42:12] <marostegui>	 I didn't see slow queries but lots of connections
[16:42:25] <twentyafterfour>	 addshore: so the config value in question affects the size of a db query's limit clause, essentially?
[16:42:29] <bblack>	 it's a tough question, cause and effect there
[16:42:40] <addshore>	 twentyafterfour: a bit more complicated than that
[16:42:47] <jynus>	 I agree with that (lots of connections) but that also happens on any other stall
[16:42:58] <addshore>	 twentyafterfour: setting of 0, which is in wmf.16 means only wb_terms will be read, the new storage wont be at all
[16:43:04] <marostegui>	 and the only thing I saw yesterday was a slow query in commons
[16:43:06] <jynus>	 e.g. the parser issue a few days ago also caused many open conenctions
[16:43:17] <mark>	 ok
[16:43:20] <addshore>	 twentyafterfour: setting of infinite (.18 default) means only the new storage will be read, not wb_terms at all
[16:43:23] <mark>	 so we don't really seem to know why we caused a full site outage twice
[16:43:24] <jynus>	 marostegui: agree with that, but also discarded it as cause
[16:43:33] <marostegui>	 sure
[16:43:34] <addshore>	 twentyafterfour: anything between will mean both are read, some tables for some items and others for others
[16:43:57] <jynus>	 of course, please take any evaluation with a grain of sault (e.g. an outage may hide metrics)
[16:43:58] <twentyafterfour>	 mark: that's what bothers me, we don't have a definite root cause it's all just conjecture
[16:44:02] <marostegui>	 mark: Today I saw no evidence of s8 (wikidata) being down, just lots of connections
[16:44:05] <bblack>	 many connections open would probably imply that from php's perspective, it needed more connections to handle the query rate vs response time, basically
[16:44:10] <bblack>	 (expanding the pool)
[16:44:12] <marostegui>	 yeah
[16:44:21] <apergos>	 we're pretty sure about today's second outage I think; that the config setting exposed this underlying issue. perhaps not exactly sure how the underlying issue broke production
[16:44:31] <jynus>	 bblack: actually, more connections are caused last time because php connections doesn't end
[16:44:51] <jynus>	 and they don't close it as soon as they are finished with mysql
[16:44:58] <jynus>	 in hopes of it being reused
[16:45:09] <jynus>	 so it just marks "stall", but I cannot say where
[16:45:19] <jynus>	 thread stall/thread long execution
[16:45:35] <bblack>	 who marks it "stall"?
[16:45:42] <Amir1>	 addshore: we eventually want to set this to infinity, does it mean the new term store is not ready for all of the reads performance-wise?
[16:45:59] <addshore>	 Amir1: I think thats a question for not today :)
[16:46:15] <jynus>	 bblack: the lack of mysql connections being reused
[16:46:29] <jynus>	 means there is unprocessed backlog
[16:46:31] <Amir1>	 addshore: noted
[16:46:40] <marostegui>	 s8 slave that serves main and api: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104&from=now-12h&to=now as it can be seen there were no significant issues apart from the spike of connections, but the qps and rows read remained the same pretty much
[16:47:23] <addshore>	 https://grafana.wikimedia.org/d/000000316/memcache?orgId=1
[16:47:23] <jynus>	 bblack: here threads == php threads, we know there are many concurrent of those, but we cannot say what they were doing
[16:47:27] <addshore>	 looks like it isnt the db
[16:47:29] <addshore>	 its memcached
[16:48:00] <bblack>	 well
[16:48:20] <twentyafterfour>	 fwiw, I couldn't find hardly any php logs for the outage yesterday, but the few I could find seemed memcached related
[16:48:22] <addshore>	 or thats another side affect
[16:48:37] <bblack>	 if I look at that db1104 graph - a lot of things happened around the same time
[16:49:32] <addshore>	 for https://tendril.wikimedia.org/host/view/db1104.eqiad.wmnet/3306 during the config change today I see a pile of connection problems
[16:50:15] <bblack>	 query write stats dropped, qps dropped, innodb pages read spikes up, innodb read ops and bytes goes up,  etc, etc
[16:50:18] <twentyafterfour>	 so yesterday there were a lot of memcached timeouts like Memcached error for key "WANCache:t:dewiki:gadgets-definition:9:2" on server "127.0.0.1:11213": A TIMEOUT OCCURRED
[16:50:20] <marostegui>	 addshore: Correct, and that's why I had to kill them, but they were not slow
[16:50:24] <bblack>	 it's not just new connections and nothing else
[16:50:53] <jynus>	 bblack: on outage, everything gets lower
[16:50:55] <marostegui>	 bblack: queries dropped because everything was overloaded, but the DB wasn't
[16:50:55] <bblack>	 disk i/o throughput at the bottom shows a big ramp up in bytes read too
[16:51:21] <jynus>	 bblack: big???
[16:51:28] <bblack>	 if the queries were just dropping off because php was busy dying in other ways, sure, but there's increased read activity and such at the same time in innodb
[16:52:07] <bblack>	 maybe I'm looking at the recovery spike?
[16:52:54] <twentyafterfour>	 yesterday I also saw a lot of errors like this in logstash: Timed out waiting for replication to reach 171966574-171966574-2221092918,171978786-171978786-788337263
[16:53:04] <addshore>	 marostegui: interesting, not slow queries, but issues with the connections?
[16:53:08] <bblack>	 this 20 minute window seems around the conns spike:
[16:53:10] <bblack>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104&from=1581085810933&to=1581087008350
[16:53:18] <twentyafterfour>	 that's from Wikimedia\Rdbms\DatabaseMysqlBase::masterPosWait
[16:53:37] <bblack>	 the various i/o graphs all look like a bunch of read traffic, while write drops off
[16:53:41] <marostegui>	 twentyafterfour: Yeah, that's "often" happening, and it tends to be a false positive. db1104 was definitely not lagging behind
[16:53:46] <bblack>	 disk latency goes up, etc
[16:54:21] <bblack>	 while qps is going down
[16:54:34] <marostegui>	 The query killer didn't kill anything as slow or overload
[16:54:37] <jynus>	 I don't see latency being affected
[16:54:38] <twentyafterfour>	 so the problem that addshore is proposing would look like that wouldn't it? lots more read traffic and writes would suffer?
[16:54:42] <bblack>	 it feels like the kind of thing where a few giant queries are causing tons of read i/o and delays for everything else
[16:54:55] <jynus>	 "tons of read i/o" I disagree
[16:55:01] <bblack>	 jynus: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104&from=1581085810933&to=1581087008350&fullscreen&panelId=32
[16:55:12] <marostegui>	 bblack: I didn't see those, and neither did the query killer. Maybe they finished before 60 seconds (which is when the query killer kicks in)
[16:55:14] <cdanis>	 you would also expect writes to go down assuming the appservers were too busy getting hung up on doing reads, or just in general, stuck executing other things, right?
[16:55:31] <jynus>	 that latency is withing normal parameters
[16:55:32] <addshore>	 twentyafterfour: so we should remember that this is currently on group1, so whatever we saw in the spikes in theory we should be seeing there right now
[16:55:38] <marostegui>	 I also checked tendril and show no slow queries
[16:55:42] <jynus>	 it hits several seconds many times
[16:55:45] <marostegui>	 (or running at the time)
[16:55:50] <bblack>	 the graph I just linked, the reported disk read latency goes from ~60ms to ~250ms
[16:56:15] <twentyafterfour>	 addshore: when I deployed wmf.18 yesterday everything went down in a matter of a few seconds
[16:56:27] <bblack>	 at the same time, disk bytes read is spiking, so it makes sense
[16:56:32] <jynus>	 bblack: we have on those servers 1s of disk io latency under normal ciscunstances
[16:56:35] <marostegui>	 bblack: that's when the spike of connections happened, but again, those connections doesn't seem to be slow queries or anything
[16:56:42] <addshore>	 twentyafterfour: indeed, but i guess group2 gets a lot more traffic than group1
[16:56:48] <twentyafterfour>	 that's a pretty drastic change if we already had it deployed to group1 I'd expect to see some signs of it somewhere
[16:56:49] <jynus>	 "normal" as in "loaded, but not worring"
[16:56:57] <twentyafterfour>	 group2 gets about 10x more traffic
[16:57:00] <twentyafterfour>	 I think
[16:57:13] <addshore>	 twentyafterfour: also the train rolling forward is group2 reading from all items, the config change was only or 1/8th of items
[16:57:18] <bblack>	 I'm just combatting the idea that it was just open idle connections doing nothing due to a php-level stall
[16:57:36] <marostegui>	 I have gotten a bit lost now, what's the point of doing the release now? what would that fix?
[16:57:45] <bblack>	 there was a spike of connections, and with them came a spike of read traffic at every level in those graphs (innodb and disk i/o stats, etc), a spike of disk read latency, a drop in write traffic, etc
[16:57:46] <jynus>	 bblack: my point is that mysql are so sensible to load, that when there is extra latency, mw starts spitting errors 1milion times a minutes
[16:57:52] <addshore>	 twentyafterfour: the important bit is probably uncached page renders that use wikidata data more than anything else
[16:58:00] <jynus>	 and I don't see that happening
[16:58:38] <bblack>	 p99 goes over a full second too
[16:58:49] <bblack>	 on "monitoring queries latency"
[16:58:50] <jynus>	 e.g. if it takes a query 1 second to be processed, heartbeat will complain
[16:59:19] <addshore>	 marostegui: release? if you mean rollback / this backport, currently group0 and 1 are reading from the wrong tables when retrieving item labels descriptions and aliases and puting bad things in caches everywhere
[16:59:36] <bblack>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104&from=1581085810933&to=1581087008350
[16:59:46] <apergos>	 it's strictly about not serving up that bad rendered data any longer than we have to, yes?
[16:59:52] <twentyafterfour>	 marostegui: I think the the most important thing is getting to the bottom of it, that can happen on monday if need be, we just can't afford to keep causing outages to trial-and-error fixing it
[16:59:53] <apergos>	 or more and more of it, I should say
[17:00:00] <twentyafterfour>	 apergos: yes
[17:00:05] <bblack>	 ^ the patterns taken as a whole over those graphs in those time ranges, looks like a handful of really expensive read queries happened and slowed down everything else coming through there.
[17:00:10] <addshore>	 apergos: yes
[17:00:44] <addshore>	 apergos: and the 2 routes I see to that are reverting this 1 default setting in Wikibase, or roll the whole train back, either way the default gets reverted and we will be okay from our side
[17:00:45] <marostegui>	 This is how the server was looking like when I checked it (at the time of the incident) - I have recovered it from my terminal history: https://phabricator.wikimedia.org/P10348
[17:00:52] <twentyafterfour>	 bblack: that seems like a good assessment, I think
[17:01:01] <mark>	 so why do we think the backport you want to do does not cause the same issues a 3rd time?
[17:01:02] <marostegui>	 addshore: ok, thanks that helps to get myself on track again :)
[17:01:21] <marostegui>	 bblack, jynus , see my paste above
[17:01:36] <addshore>	 mark: yes It should not cause any issues, it should take us back to the pre .18 train state everywhere
[17:01:43] <jynus>	 that is a normal status, lots of connections but that doesn't impact latency
[17:01:50] <marostegui>	 jynus: that's my point :)
[17:02:04] <addshore>	 mark: or we can rollback the train, just from our side that seems larger, but we would be happy with either
[17:02:12] <mark>	 in other words, not read from the new tables at all?
[17:02:15] <jynus>	 if we have 64 running threads, that would be your stall
[17:02:19] <addshore>	 mark: yes
[17:02:22] <marostegui>	 which is not what I saw
[17:02:24] <marostegui>	 :)
[17:02:35] <mark>	 whereas the change hoo did earlier, caused _more_ reads from the new tables?
[17:02:36] <addshore>	 mark: both the backport or a rollback of train will result in clients not reading from the new tables at all
[17:02:36] <jynus>	 but they are on Sleep, which is the state they are after running a query but not closing connections
[17:03:02] <marostegui>	 And that's why we see the increase in connections
[17:03:09] <marostegui>	 And I was saying that I didn't see any long running query
[17:03:13] <jynus>	 again, the typical state we had in other outages for that is "php threads are slow/waiting for something"
[17:03:19] <addshore>	 mark: the change hoo did earlier increased reads for group2 from not reading, to reading up to item 10,000,000 and decreased reads from group0 and 1 from all items down the 10,000,000
[17:03:20] <twentyafterfour>	 one thread in that paste is "sending data": SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseTermInLangIdsResolver::selectTermsViaJoin  */
[17:03:37] <mark>	 i see
[17:03:44] <mark>	 and the net effect (group 2 is larger than 0/1) is an increase
[17:03:53] <apergos>	 figure the first 10 millon items are likely the most referenced too
[17:03:58] <apergos>	 on the client wikis
[17:03:59] <addshore>	 mark: roughly,
[17:04:17] <jynus>	 typical mysql stalls are things like "many selects" "many threads in waiting/locking state", etc
[17:04:18] <addshore>	 apergos: very probable
[17:04:27] <mark>	 i'm guessing
[17:04:34] <mark>	 and I want to hear other's opinions
[17:04:41] <mark>	 that making the minimal change to the pre-.18 state seems best
[17:05:00] <twentyafterfour>	 jynus: yes this wasn't a mysql stall, I think we can take your word for that
[17:05:05] <jynus>	 or connectiosn issues that would be shown on logstash, at least the leading ones untul full outage
[17:05:13] <bblack>	 https://phabricator.wikimedia.org/P10348#60136
[17:05:13] <mark>	 and that probably won't cause the issue again, but we should be ready to quickly revert immediately
[17:05:17] <mark>	 what do you all think?
[17:05:27] <bblack>	 ^ added to the paste, a version with the sleeps filtered out, to see what was left at the time doing something else
[17:05:27] <twentyafterfour>	 mark +1
[17:05:28] <jynus>	 so I don't mind, I just don't want people ofuscated on the wrong place
[17:05:37] <marostegui>	 bblack: thanks :)
[17:05:39] <jynus>	 we need debugging of the app side
[17:05:47] <jynus>	 to see where is the stall/blockage
[17:06:24] <mark>	 and then we debug all this further and try again next week
[17:06:29] <mark>	 when we understand it better
[17:06:36] <apergos>	 if it were a one line change I'd +1... given that it isn't, I'd prefer a clean rollback to a known good state
[17:06:42] <apergos>	 and again debug all further next week
[17:06:46] <addshore>	 apergos: it is a one line change
[17:06:55] <apergos>	 am I looking at the wrong patch?
[17:07:02] <apergos>	 please relink me
[17:07:16] <addshore>	 sorry, 2 line change, and a test change https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/570910/
[17:07:20] <addshore>	 and already merged on master too
[17:07:38] <addshore>	 as otherwise given the current state we would be blocking the train indefinitely
[17:07:40] <mark>	 i need to leave for a while in max 10 minutes btw
[17:08:00] <apergos>	 right
[17:08:41] <twentyafterfour>	 from my perspective, it's ok to block the train but not ok to ask a bunch of sre team to not have any time off this weekend
[17:09:00] <addshore>	 *agrees*
[17:09:25] <mark>	 i'm sorry we're being so careful, but we've had an absolutely torrid two weeks already
[17:09:37] <addshore>	 the backport or trainrollback doesn't really relate to the outages at all, it relates to the state of group0 and 1 and bad data being thrown around there, just this happens to also be related to the outages
[17:09:45] <tarrow>	 no problem; we've also had a fun week
[17:09:45] <addshore>	 mark: yup
[17:10:00] <twentyafterfour>	 mark: more than justified!
[17:10:03] <thcipriani>	 blocking the train indefinitely is bad, but blocking until Monday is not a problem.
[17:10:17] <twentyafterfour>	 +1
[17:10:36] <addshore>	 +1 my concern is more about the bad data than train blocking right now :)
[17:10:45] <apergos>	 I'm just looking at .18 to see changes that went in on wikibase since the https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/570910/
[17:10:45] <twentyafterfour>	 blocking the train until monday is practically the norm at this point
[17:11:03] <thcipriani>	 I can't tell if there's consensus around a proposal or not, but it seems when you find yourself in a hole you should quit digging. Is the option between changing config and rolling back?
[17:11:05] <apergos>	 try to make sure there's nothing we're overlooking
[17:11:14] <apergos>	 yes thcipriani that exactly
[17:11:17] <addshore>	 yes
[17:11:26] <thcipriani>	 I think I'd vote rolling back to a known safe state
[17:11:32] <twentyafterfour>	 and I think  both options are fairly equal actually
[17:11:33] <tarrow>	 thcipriani: backporting a default config in wikibase.git or rolling the train back
[17:11:38] <thcipriani>	 and coming at this with fresh eyes monday
[17:12:08] <twentyafterfour>	 yes
[17:12:29] <mark>	 anyone disagree with that plan?
[17:12:34] <jynus>	 bblack: see open connections in context, all wiki dbs had it https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=9&from=1581073929958&to=1581095529959&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All
[17:12:35] <marostegui>	 fine with me
[17:12:39] <addshore>	 I could also try to make a mediawiki-config change, but that seems more risky
[17:13:11] <twentyafterfour>	 so, this is sort of a separate issue but I wonder if we should default to rolling back _all_ wikis when a group2 rollout kills prod.  I only rolled back group2 since group0/1 didn't have any noticeable issues. Should I have defaulted to roll back all?
[17:13:23] <apergos>	 can we hold that discussion (keep that thought)
[17:13:34] <apergos>	 until we have the decision and the deploy?
[17:13:42] <apergos>	 we're soooo close :-)
[17:13:44] <twentyafterfour>	 apergos: np
[17:13:48] <mark>	 let's do the rollback
[17:14:14] <twentyafterfour>	 yeah I think we have agreement on rollback. who wants to do the needful? Shall I roll all wikis to wmf.16?
[17:14:36] <apergos>	 I'll be around for the next hour paying attention, for this
[17:14:40] <mark>	 i need to leave now, but can be back in 30 minutes if I need to
[17:15:00] <twentyafterfour>	 thanks mark, hopefully you won't need to?
[17:15:05] * addshore will keep his eyes open
[17:15:06] <bblack>	 I'm around too
[17:15:07] <mutante>	 also prefer rollback to known stable state. i'll also have to go to an appointment in about half an hour and we have little coverage. i'll be around for now though
[17:15:31] <twentyafterfour>	 ok so I guess sooner is better to do the rollback right away
[17:15:43] <thcipriani>	 twentyafterfour: want to do the needful?
[17:15:53] * twentyafterfour prepares to scap
[17:15:57] <bblack>	 but yeah I'm not looking forward to this recurring all weekend, we're still recovering from all hands and a litany of incidents
[17:15:57] * addshore opens logstash etc
[17:16:17] <greg-g>	 thanks twentyafterfour
[17:17:29] <thcipriani>	 ^ thank you
[17:18:32] <XioNoX>	 around too but that's not really my area of expertise :)
[17:18:47] <mutante>	 switching to -operations
[17:37:45] <addshore>	 Thanks all! hate doing all of this at all, let alone on a Friday at 6pm+ for europeans
[18:11:14] <jynus>	 I been reading and I am not sure deployment is was the core cause, although that doesn't meant there are no code issues, specially on .18
[18:12:09] <jynus>	 there was a huge slowdown on mediawiki from 14:03 to 14:28
[18:12:25] <jynus>	 but that didn't cause a large outage
[18:13:16] <jynus>	 there was a minor slowdown 14:37 - 14:46, which impacted trafic way more
[18:14:32] <jynus>	 the slowdown was mostly on the api
[18:14:50] <jynus>	 the second affecting everything
[18:15:26] <jynus>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=POST&var-code=200&var-code=204&var-code=205&var-code=301&var-code=302&var-code=303&var-code=304&var-code=307&var-code=400&var-code=401&var-code=403&var-code=404&var-code=405&var-code=406&var-code=409&var-code=410&var-code=412&var-code=413&var-code=415&var-code=429&var-code=500&var-
[18:15:28] <jynus>	 code=503&from=1581083545663&to=1581088017159&fullscreen&panelId=46
[18:16:04] <jynus>	 the requests that timed out seem to be at random moments (curl, queries, memcache) with no clear trend