[01:27:46] marostegui (+jymus): getting lots of errors today of the form "ERROR 2013 (HY000): Lost connection to MySQL server during query" / "ERROR 2006 (HY000): MySQL server has gone away", when trying to run queries on analytics-store from stat1003 ... [01:27:51] ..bad weather? [07:45:01] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751204 (10Marostegui) So I would like to get another pair of eyes here, as if this goes wrong, we might need to rebuild the whole server :-( There are currently 3 new disks there that were not included in... [07:48:07] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751208 (10jcrespo) Looks good, maybe rebuilding one at a time, to avoid IO exhaustion? [07:48:55] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751209 (10Marostegui) Yeah - as I said, I would only add (and rebuild) once at the time. [07:58:04] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751213 (10jcrespo) Sorry, I overlooked that and looked only at the commands. [08:07:43] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751253 (10Marostegui) No worries! Better be safe than sorry :) [08:58:46] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751329 (10Marostegui) After a chat with Jaime yesterday we decided to try one more thing: (The master getting inserts during the whole process) - Slave replicating from two masters (s1 and s3) without GTID an... [09:00:14] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2751332 (10jcrespo) CirrusSearch\BuildDocument\RedirectsAndIncomingLinks::countIncomingLinks is so slow, that it is getting... [09:00:23] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2751333 (10jcrespo) p:05Normal>03High [09:23:53] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751372 (10jcrespo) We need to test our use-case: single master with pt-heartbeat creating master events on the master's master. Should we put secondary datacenter masters with the same domain id as the primary... [09:40:25] I see a bunch of (but still less than 15) "Can't connect to MySQL server on '10.64.16.144'" in the exception logs [09:40:31] That's the s5 master (db1049) [09:40:47] jynus: ^ Is that just noise? [09:40:56] no [09:41:02] I have reported that issue many times [09:41:14] too many master connections creates slowdown there [09:42:58] Makes sense :/ [09:43:19] I did a lot to reduce the count in the recent months… not sure what's left [09:44:53] Do you have a break down of what happens on the masters most? Maybe there are some low-hanging fruits left on our side [09:45:15] in fact it is not new connections [09:45:25] it is long-running connections what is the issue [09:45:34] ah, hm [09:45:55] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1059&from=now-1h&to=now [09:45:57] vs [09:46:05] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1049&from=now-1h&to=now [09:46:12] but compare the new connections tab [09:46:13] Probably the dispatchers then :/ [09:46:15] enwiki [09:46:33] has 10 times more new connections [09:46:33] They need master connections and run up to 15m (or something around that) [09:46:43] that's crazy [09:47:00] Indeed [09:47:00] and blocks the 36 available connections [09:47:16] why does that need a master connection? [09:47:17] We wanted to get rid of that for ages now :/ [09:47:32] there is replication control checking replication [09:47:33] Because we use master for tracking which wiki has received which changes [09:47:34] ˜/hoo 11:46> They need master connections and run up to 15m (or something around that) -> ??? [09:47:39] Wow! [09:47:49] so- there is not reason :-) [09:48:45] jynus: Well, we dispatch changes to client's on other shards [09:48:58] yes, that you could do from the slave [09:49:07] And track on master up to which change these client's have received edits from Wikidata [09:49:21] that you could sync using the slave [09:49:22] We could potentially also track that on the client's database [09:49:53] but we need a to record that "enwiki" has received changes up to 123456 [09:50:02] yes, we call that gtid [09:50:09] and the slave has that [09:50:24] and AFAIK there is a core function for that [09:50:39] I'm not talking about Master -> Slave [09:50:45] but Wikidata -> Client wikis [09:50:53] we need to (in software) handle changes on Wikidata [09:51:00] and act up on them in the client's [09:51:00] ok, so revision version? [09:51:07] kindof? [09:51:14] why not read it from the slave? [09:51:24] Well, we can *read* it from there [09:51:35] but we need to track up to where we have dispatched somewhere [09:51:58] Say we have changes 0 - 1000 and dispatch in batches of 100 [09:52:13] We will need to record that we dispatched up to 100, 200, … after each run [09:52:16] ok, so you are using mysql as a unique sync place [09:52:22] which is a horrible idea [09:52:26] Yes and yes [09:52:28] and a SPOF [09:52:35] not mysql [09:52:43] using the master, which is the most delicate thing [09:53:01] I would use something else, but we could setup a decicate mysql aside from the master for that [09:53:23] it would be a single place, no code changes, but would aliviate master's load [09:54:54] how do you coordinate changes using mysql lock? [09:55:13] We do the most horrible thing [09:55:18] We set the lock in the table [09:55:27] don tell me you write rows? [09:55:34] a full lock? [09:55:35] hoo: you just described a queue [09:55:40] a full table lock? [09:55:51] jynus: No, we store it in a field [09:56:09] so you amplify writes on the master, which gets send to 20 slaves [09:56:12] (that's on the long list of stuff to die on our side) [09:56:18] and then only read from the master [09:56:18] yes [09:56:20] yes [09:57:31] I'm afk for a bit… I'll come back to you later and maybe CC you on the tickets [10:00:29] <_joe_> hoo: https://phabricator.wikimedia.org/T149408 slightly related :) [10:04:07] actually, joe it is not related. I mean, it is of interest of him, but the issue here is not the job queue handling itself, but the model of synchronization between events [10:04:30] <_joe_> yes [10:04:55] I have the intention to create s8 [10:04:57] <_joe_> that should be done with a queue and not on the database [10:05:10] so that if it explodes, it explodes alone :-) [10:05:24] <_joe_> eheh ok that's a tactic that would be effective for sure [10:05:31] <_joe_> what do you want to move to s8? [10:05:35] <_joe_> just wikidata? [10:05:39] not really a queue- anything that doesnt involve 20 servers [10:05:49] for a write [10:06:06] in memory, preferently [10:06:40] yes, the idea is to move wikidata to s8 [10:06:49] then maybe rebalance the rest a bit [10:07:16] s6 is underused [10:07:28] and s5 would be without wikidata [10:25:11] db1073 is unusually lagged, but it is not even pooled [10:25:31] I will probably move it around, to a place where it is needed [10:26:36] you think it will perform better somewhere else? [10:26:47] is it because it cannot cope with the load? [10:26:58] no idea [10:27:13] but a reimage would either fix it or show the issue :-) [10:32:04] I replied on the ticket, I can take of it if you like [10:32:46] marostegui, if you take everyhing I suggest we do, you will end up like me :-) [10:32:57] hahaha [10:33:10] well, we don't like lagged slaves :) [10:33:42] well, first we should think what we should do with it [10:34:14] probably try to discard a hardware/config issue, otherwise it will probably lag in any other shard I guess [10:34:21] oh db1073 is the one you compressed? [10:34:34] oh [10:34:38] that is bad news [10:34:43] :( [10:35:00] Well, we can see if the same behaviour arises in dbstore2001 [10:35:58] does it have higher CPU usage? [10:41:10] https://ganglia.wikimedia.org/latest/graph_all_periods.php?title=cpu_idle&mreg[]=%5Ecpu_idle%24&hreg[]=%5Edb1073&aggregate=1&hl=db1073.eqiad.wmnet|MySQL%20eqiad [10:41:13] not really [10:41:24] I do not see innodb difference either [10:41:36] but let me add some graphs regarding compression failures [10:49:54] page_size | compress_ops | compress_ops_ok | compress_time | uncompress_ops | uncompress_time [10:50:01] 8192 | 140675338 | 139309885 | 136306 | 345211128 | 77054 [10:50:09] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2751593 (10Marostegui) db2069, db2070, dbstore2002, db2042 are done. The only pending one in codfw db2016 (the master, which will be done on Monday) ```... [10:50:51] that looks good to me, very small pct of compression failures [10:51:13] true [10:51:28] might be something else then, we can see how dbstore2001 behaves [10:51:32] because it also has the other threads [10:51:38] so it can be an insteresting tst [10:51:39] test [10:51:42] lunch :) [11:00:47] 82910901 | 81407895 for dbstore1002, which is really good, too [11:00:52] *2002 [11:00:55] *2001 [11:56:01] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751657 (10Marostegui) I can try that in my lab too. Master 1 -> Master 2 -> MultiSource + GTID slave To be honest, I would go for a different domain_id on every host because that would make things easier fro... [12:41:48] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751721 (10Marostegui) I have tested these two scenarios (with the normal pt-heartbeat tool not our modified one) --- A) Master 1 (pt-heartbeat) -> Master 2 -> MultiSource + GTID slave Master 3 (pt-heartbeat)... [12:44:49] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751725 (10jcrespo) We can puppetize it (but not deploy today), and start rolling it, on misc and codfw first. [12:46:41] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2751726 (10Marostegui) >>! In T146261#2751725, @jcrespo wrote: > We can puppetize it (but not deploy today), and start rolling it, on misc and codfw first. > > Maybe tracking it on a separate ticket, make this... [12:56:01] 10DBA: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2751731 (10Marostegui) [12:56:44] 10DBA: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2751731 (10Marostegui) [13:42:04] 10DBA: Prepare for mariadb 10.1 - https://phabricator.wikimedia.org/T149422#2751899 (10jcrespo) [13:42:50] 10DBA: Prepare for mariadb 10.1 - https://phabricator.wikimedia.org/T149422#2751914 (10jcrespo) [13:42:52] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2751913 (10jcrespo) [13:43:44] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 - https://phabricator.wikimedia.org/T140452#2465396 (10jcrespo) [13:43:46] 10DBA, 06Labs, 10Labs-Infrastructure: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2751915 (10jcrespo) [13:44:34] 10DBA, 06Labs, 10Labs-Infrastructure, 07Epic, 07Tracking: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788#2751919 (10jcrespo) [13:44:36] 10DBA, 06Labs, 10Labs-Infrastructure: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#2546917 (10jcrespo) [14:00:00] 10DBA: Prepare for mariadb 10.1 - https://phabricator.wikimedia.org/T149422#2751969 (10jcrespo) a:03jcrespo [14:17:14] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2752031 (10jcrespo) [14:17:18] 10DBA, 06Labs, 10Labs-Infrastructure: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2752030 (10jcrespo) [14:35:14] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2752066 (10EBernhardson) I'll re-put together the stuff that queries this out of elasticsearch instead of mysql. It's comple... [14:37:22] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2752075 (10jcrespo) I do not think this is unfixable, I just need the time, which I probably will have soon. [14:42:53] https://jira.mariadb.org/browse/MDEV-11101 I think I will fix this for wmf packages [16:22:18] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932#2752354 (10EBernhardson) To get an idea of what the latencies in elasticsearch look like i pulled a histogram from our histo... [16:54:47] 10DBA, 06Operations: dbtree broken - https://phabricator.wikimedia.org/T149357#2752434 (10Aklapper) [22:26:35] 10DBA, 06Operations, 10ops-codfw: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2753248 (10RobH)