[01:27:46] <HaeB>	 marostegui (+jymus): getting lots of errors today of the form "ERROR 2013 (HY000): Lost connection to MySQL server during query" / "ERROR 2006 (HY000): MySQL server has gone away", when trying to run queries on analytics-store from stat1003 ...
[01:27:51] <HaeB>	 ..bad weather?
[07:45:01] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751204 (10Marostegui) So I would like to get another pair of eyes here, as if this goes wrong, we might need to rebuild the whole server :-(  There are currently 3 new disks there that were not included in...
[07:48:07] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751208 (10jcrespo) Looks good, maybe rebuilding one at a time, to avoid IO exhaustion?
[07:48:55] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751209 (10Marostegui) Yeah - as I said, I would only add (and rebuild) once at the time.
[07:58:04] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751213 (10jcrespo) Sorry, I overlooked that and looked only at the commands.
[08:07:43] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751253 (10Marostegui) No worries! Better be safe than sorry :)
[08:58:46] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751329 (10Marostegui) After a chat with Jaime yesterday we decided to try one more thing:  (The master getting inserts during the whole process) - Slave replicating from two masters (s1 and s3) without GTID an...
[09:00:14] <wikibugs>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2751332 (10jcrespo) CirrusSearch\BuildDocument\RedirectsAndIncomingLinks::countIncomingLinks is so slow, that it is getting...
[09:00:23] <wikibugs>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2751333 (10jcrespo) p:05Normal>03High
[09:23:53] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751372 (10jcrespo) We need to test our use-case: single master with pt-heartbeat creating master events on the master's master. Should we put secondary datacenter masters with the same domain id as the primary...
[09:40:25] <hoo>	 I see a bunch of (but still less than 15) "Can't connect to MySQL server on '10.64.16.144'" in the exception logs
[09:40:31] <hoo>	 That's the s5 master (db1049)
[09:40:47] <hoo>	 jynus: ^ Is that just noise?
[09:40:56] <jynus>	 no
[09:41:02] <jynus>	 I have reported that issue many times
[09:41:14] <jynus>	 too many master connections creates slowdown there
[09:42:58] <hoo>	 Makes sense :/
[09:43:19] <hoo>	 I did a lot to reduce the count in the recent months… not sure what's left
[09:44:53] <hoo>	 Do you have a break down of what happens on the masters most? Maybe there are some low-hanging fruits left on our side
[09:45:15] <jynus>	 in fact it is not new connections
[09:45:25] <jynus>	 it is long-running connections what is the issue
[09:45:34] <hoo>	 ah, hm
[09:45:55] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1059&from=now-1h&to=now
[09:45:57] <jynus>	 vs
[09:46:05] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1049&from=now-1h&to=now
[09:46:12] <jynus>	 but compare the new connections tab
[09:46:13] <hoo>	 Probably the dispatchers then :/
[09:46:15] <jynus>	 enwiki
[09:46:33] <jynus>	 has 10 times more new connections
[09:46:33] <hoo>	 They need master connections and run up to 15m (or something around that)
[09:46:43] <jynus>	 that's crazy
[09:47:00] <hoo>	 Indeed
[09:47:00] <jynus>	 and blocks the 36 available connections
[09:47:16] <jynus>	 why does that need a master connection?
[09:47:17] <hoo>	 We wanted to get rid of that for ages now :/
[09:47:32] <jynus>	 there is replication control checking replication
[09:47:33] <hoo>	 Because we use master for tracking which wiki has received which changes
[09:47:34] <marostegui>	 ˜/hoo 11:46> They need master connections and run up to 15m (or something around that) -> ???
[09:47:39] <marostegui>	 Wow!
[09:47:49] <jynus>	 so- there is not reason :-)
[09:48:45] <hoo>	 jynus: Well, we dispatch changes to client's on other shards
[09:48:58] <jynus>	 yes, that you could do from the slave
[09:49:07] <hoo>	 And track on master up to which change these client's have received edits from Wikidata
[09:49:21] <jynus>	 that you could sync using the slave
[09:49:22] <hoo>	 We could potentially also track that on the client's database
[09:49:53] <hoo>	 but we need a to record that "enwiki" has received changes up to 123456
[09:50:02] <jynus>	 yes, we call that gtid
[09:50:09] <jynus>	 and the slave has that
[09:50:24] <jynus>	 and AFAIK there is a core function for that
[09:50:39] <hoo>	 I'm not talking about Master -> Slave
[09:50:45] <hoo>	 but Wikidata -> Client wikis
[09:50:53] <hoo>	 we need to (in software) handle changes on Wikidata
[09:51:00] <hoo>	 and act up on them in the client's
[09:51:00] <jynus>	 ok, so revision version?
[09:51:07] <jynus>	 kindof?
[09:51:14] <jynus>	 why not read it from the slave?
[09:51:24] <hoo>	 Well, we can *read* it from there
[09:51:35] <hoo>	 but we need to track up to where we have dispatched somewhere
[09:51:58] <hoo>	 Say we have changes 0 - 1000 and dispatch in batches of 100
[09:52:13] <hoo>	 We will need to record that we dispatched up to 100, 200, … after each run
[09:52:16] <jynus>	 ok, so you are using mysql as a unique sync place
[09:52:22] <jynus>	 which is a horrible idea
[09:52:26] <hoo>	 Yes and yes
[09:52:28] <jynus>	 and a SPOF
[09:52:35] <jynus>	 not mysql
[09:52:43] <jynus>	 using the master, which is the most delicate thing
[09:53:01] <jynus>	 I would use something else, but we could setup a decicate mysql aside from the master for that
[09:53:23] <jynus>	 it would be a single place, no code changes, but would aliviate master's load
[09:54:54] <jynus>	 how do you coordinate changes using mysql lock?
[09:55:13] <hoo>	 We do the most horrible thing
[09:55:18] <hoo>	 We set the lock in the table
[09:55:27] <jynus>	 don tell me you write rows?
[09:55:34] <jynus>	 a full lock?
[09:55:35] <volans>	 hoo: you just described a queue
[09:55:40] <jynus>	 a full table lock?
[09:55:51] <hoo>	 jynus: No, we store it in a field
[09:56:09] <jynus>	 so you amplify writes on the master, which gets send to 20 slaves
[09:56:12] <hoo>	 (that's on the long list of stuff to die on our side)
[09:56:18] <jynus>	 and then only read from the master
[09:56:18] <hoo>	 yes
[09:56:20] <hoo>	 yes
[09:57:31] <hoo>	 I'm afk for a bit… I'll come back to you later and maybe CC you on the tickets
[10:00:29] <_joe_>	 hoo: https://phabricator.wikimedia.org/T149408 slightly related :)
[10:04:07] <jynus>	 actually, joe it is not related. I mean, it is of interest of him, but the issue here is not the job queue handling itself, but the model of synchronization between events
[10:04:30] <_joe_>	 yes
[10:04:55] <jynus>	 I have the intention to create s8
[10:04:57] <_joe_>	 that should be done with a queue and not on the database
[10:05:10] <jynus>	 so that if it explodes, it explodes alone :-)
[10:05:24] <_joe_>	 eheh ok that's a tactic that would be effective for sure
[10:05:31] <_joe_>	 what do you want to move to s8?
[10:05:35] <_joe_>	 just wikidata?
[10:05:39] <jynus>	 not really a queue- anything that doesnt involve 20 servers
[10:05:49] <jynus>	 for a write
[10:06:06] <jynus>	 in memory, preferently
[10:06:40] <jynus>	 yes, the idea is to move wikidata to s8
[10:06:49] <jynus>	 then maybe rebalance the rest a bit
[10:07:16] <jynus>	 s6 is underused
[10:07:28] <jynus>	 and s5 would be without wikidata
[10:25:11] <jynus>	 db1073 is unusually lagged, but it is not even pooled
[10:25:31] <jynus>	 I will probably move it around, to a place where it is needed
[10:26:36] <marostegui>	 you think it will perform better somewhere else?
[10:26:47] <marostegui>	 is it because it cannot cope with the load?
[10:26:58] <jynus>	 no idea
[10:27:13] <jynus>	 but a reimage would either fix it or show the issue :-)
[10:32:04] <marostegui>	 I replied on the ticket, I can take of it if you like
[10:32:46] <jynus>	 marostegui, if you take everyhing I suggest we do, you will end up like me :-)
[10:32:57] <marostegui>	 hahaha
[10:33:10] <marostegui>	 well, we don't like lagged slaves :)
[10:33:42] <jynus>	 well, first we should think what we should do with it
[10:34:14] <marostegui>	 probably try to discard a hardware/config issue, otherwise it will probably lag in any other shard I guess
[10:34:21] <marostegui>	 oh db1073 is the one you compressed?
[10:34:34] <jynus>	 oh
[10:34:38] <jynus>	 that is bad news
[10:34:43] <marostegui>	 :(
[10:35:00] <marostegui>	 Well, we can see if the same behaviour arises in dbstore2001
[10:35:58] <jynus>	 does it have higher CPU usage?
[10:41:10] <marostegui>	 https://ganglia.wikimedia.org/latest/graph_all_periods.php?title=cpu_idle&mreg[]=%5Ecpu_idle%24&hreg[]=%5Edb1073&aggregate=1&hl=db1073.eqiad.wmnet|MySQL%20eqiad
[10:41:13] <marostegui>	 not really
[10:41:24] <jynus>	 I do not see innodb difference either
[10:41:36] <jynus>	 but let me add some graphs regarding compression failures
[10:49:54] <jynus>	 page_size | compress_ops | compress_ops_ok | compress_time | uncompress_ops | uncompress_time
[10:50:01] <jynus>	 8192 |    140675338 |       139309885 |        136306 |      345211128 |           77054
[10:50:09] <wikibugs>	 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2751593 (10Marostegui) db2069, db2070, dbstore2002, db2042 are done. The only pending one in codfw db2016 (the master, which will be done on Monday)   ```...
[10:50:51] <jynus>	 that looks good to me, very small pct of compression failures
[10:51:13] <marostegui>	 true
[10:51:28] <marostegui>	 might be something else then, we can see how dbstore2001 behaves
[10:51:32] <marostegui>	 because it also has the other threads
[10:51:38] <marostegui>	 so it can be an insteresting tst
[10:51:39] <marostegui>	 test
[10:51:42] <marostegui>	 lunch :)
[11:00:47] <jynus>	 82910901 |        81407895 for dbstore1002, which is really good, too
[11:00:52] <jynus>	 *2002
[11:00:55] <jynus>	 *2001
[11:56:01] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751657 (10Marostegui) I can try that in my lab too.  Master 1 -> Master 2 -> MultiSource + GTID slave  To be honest, I would go for a different domain_id on every host because that would make things easier fro...
[12:41:48] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751721 (10Marostegui) I have tested these two scenarios (with the normal pt-heartbeat tool not our modified one)  --- A) Master 1 (pt-heartbeat) -> Master 2 -> MultiSource + GTID slave Master 3 (pt-heartbeat)...
[12:44:49] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751725 (10jcrespo) We can puppetize it (but not deploy today), and start rolling it, on misc and codfw first.
[12:46:41] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751726 (10Marostegui) >>! In T146261#2751725, @jcrespo wrote: > We can puppetize it (but not deploy today), and start rolling it, on misc and codfw first. >  > Maybe tracking it on a separate ticket, make this...
[12:56:01] <wikibugs>	 10DBA: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2751731 (10Marostegui)
[12:56:44] <wikibugs>	 10DBA: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2751731 (10Marostegui)
[13:42:04] <wikibugs>	 10DBA: Prepare for mariadb 10.1 - https://phabricator.wikimedia.org/T149422#2751899 (10jcrespo)
[13:42:50] <wikibugs>	 10DBA: Prepare for mariadb 10.1 - https://phabricator.wikimedia.org/T149422#2751914 (10jcrespo)
[13:42:52] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2751913 (10jcrespo)
[13:43:44] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2465396 (10jcrespo)
[13:43:46] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2751915 (10jcrespo)
[13:44:34] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 07Epic, 07Tracking: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788#2751919 (10jcrespo)
[13:44:36] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2546917 (10jcrespo)
[14:00:00] <wikibugs>	 10DBA: Prepare for mariadb 10.1 - https://phabricator.wikimedia.org/T149422#2751969 (10jcrespo) a:03jcrespo
[14:17:14] <wikibugs>	 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2752031 (10jcrespo)
[14:17:18] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2752030 (10jcrespo)
[14:35:14] <wikibugs>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2752066 (10EBernhardson) I'll re-put together the stuff that queries this out of elasticsearch instead of mysql. It's comple...
[14:37:22] <wikibugs>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2752075 (10jcrespo) I do not think this is unfixable, I just need the time, which I probably will have soon.
[14:42:53] <jynus>	 https://jira.mariadb.org/browse/MDEV-11101 I think I will fix this for wmf packages
[16:22:18] <wikibugs>	 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2752354 (10EBernhardson) To get an idea of what the latencies in elasticsearch look like i pulled a histogram from our histo...
[16:54:47] <wikibugs>	 10DBA, 06Operations: dbtree broken - https://phabricator.wikimedia.org/T149357#2752434 (10Aklapper)
[22:26:35] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2753248 (10RobH)