[07:45:09] 10DBA, 10Operations, 10ops-eqiad: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) 05Open→03Resolved No differences found on s3, s2 tables between source backups and production. Issue fixed. [08:14:28] Hi DBAs! [08:15:09] dbA today [08:15:39] Vacation? :D [08:15:43] so only defcon4 and above today :-) [08:16:08] We tried going back to 6 million yesterday, then rolled back to 2 million due to the same increase in connections and decreased worst response time. [08:16:46] After looking back in the evening I realized that this pattern only happened on the most pooled server, and all the other replicas seemed to behave just fine [08:17:28] I wonder if, as you said a few days ago, the load might need shifting around given the new queries [08:17:44] we already shifted it as much as we could [08:18:32] Aaah, okay [08:19:11] I'm guessing the weights are resource based? [08:19:33] the other hosts handle api, and other requests [08:20:36] and is that decided by the "group" passed through the code in mediawiki? [08:20:50] or, rather, the entry point? [08:22:24] sorry, I need to check why db1101:s8 has been producing 200K errors since 5am [08:22:53] ack! [08:23:48] https://usercontent.irccloud-cdn.com/file/dgGcvYzk/image.png [08:23:56] i see a spike in db performance for one of our queries [08:24:13] there is a 26622 query running there [08:24:15] *Second [08:24:23] SpecialFewestRevisions::reallyDoQuery [08:24:28] O_o [08:28:59] still a lot of lag [08:29:49] dbperf errors since 5am https://logstash.wikimedia.org/goto/442974b248eab0759127d0b9a1a412cc [08:29:58] for the terms table realted queris [08:30:46] that query caused a lot of purge lag: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1101&var-port=13318&fullscreen&panelId=11&from=1582770635506&to=1582792235506 [08:31:05] I will wait for that to go away and see how it evolves [08:32:19] I wonder if we will be able to make this terms table normalization work at all now :/ its not looking good [08:32:39] but I am not sure the root cause is fixed, there is an increase in table scans [08:32:53] still going on [08:35:29] ok, going down now, we may have recovered [08:35:55] Interesting, it looks like that query may also have caused dispatching to slow down, causing the dispatching lag based max lag to reach 5 and thus edit rate to slow, https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=1582770915673&to=1582792515673 [08:36:24] well, it made that lag 1-2 seconds, which means all threads going to that host stall for 1-2 seconds [08:36:25] it also started between 5:10 and 5:20 and just recovered [08:36:40] ack, okay [08:36:54] because wait happens for >0 seconds, but things don't get depooled until they get to 5 [08:37:17] and many cases, performance issues causes to be hold at 1-2 seconds [08:37:27] in this case due to that query [08:37:36] but same thing happens with the other deployment [08:38:02] and yes, the db performance logstash messages are also disappearing now :) [08:40:42] jynus: https://phabricator.wikimedia.org/T238199 [08:41:23] I was writing there, not the first time it happened [08:41:43] sadly, as tables get bigger (relevant here), queries that read all rows get slower [08:42:21] yup, i'll try to raise it with our product people and see what we want to do (probably just turn it off) [08:42:29] nothing to do with wikidata [08:42:37] it was a wink to the other issue [08:42:44] that is pure mw special pages [08:42:57] just happens that wikidata has the most pages/etc. [08:43:25] yupp [08:44:02] also telling you that on days there is reduced dba availability, I may ignore you [08:44:09] because fires [08:44:16] (but not on purpose) [08:44:45] but feel free to ping me on tickets and/or schedule a meeting [08:46:34] I really don't think you should focus on that issue T238199 right now [08:46:35] T238199: SpecialFewestRevisions::reallyDoQuery takes more than 9h to run - https://phabricator.wikimedia.org/T238199 [08:46:41] * addshore will not :) [08:46:52] * addshore is focusing on this terms thing [08:47:21] there is an extra wikidata server that may be available tomorrow [08:47:37] we should reevaluate then [08:47:45] That is an interesting fact [08:47:54] but tell me, which percentage of items are migrated? [08:48:24] migrated in terms of data being written, we are at about 50 million out of 80 million [08:48:36] no, in terms of utilization [08:48:51] in terms of reading, wikidata.org has been reading for 10 million items for months, clients currently only read up to 2 million now [08:49:07] whenever we go to 6 milion we see these connection issues on the replica with the most weight [08:49:27] is batch write migration ongoing? [08:49:55] 1 thing to note is when we go to 6 million, we have crossed the half way point in terms of read queries, at that point more queries are hitting the new tables compared to the old tables [08:50:09] as the smaller Q numbers have a much wider use than the larger Q numbers [08:50:23] yes, I counted on that [08:50:33] do you know the write thing? [08:50:46] I am asking because if we are around a 75% [08:50:54] maybe we should wait for it to finish [08:51:00] batch write migration is indeed ongoing [08:51:12] as that creates a tax on writes [08:51:16] which causes lag [08:51:26] which is the #1 cause of performace penalty [08:51:31] okay [08:51:39] did you see this last unrelated issue? [08:51:43] that was on one server [08:52:24] the issue happens on all server when lags start to happen due to general traffic [08:52:35] it just tips the point on one first [08:53:14] I am not saying that will solve things, but if we are pass a threashold, maybe we can go all the way (minus the write only to new) [08:53:30] Yes, rep lag got a bit spikey [08:53:30] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1582742236847&to=1582754266357&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104 [08:53:44] the other thing to consider is that we got all the inconveniences of the migration [08:54:00] okay, I might experiment today then, and pause the batch write migration, and try Q 6 million again for a period [08:54:01] but not yet the main advantage, which is dropping the monolithical table [08:54:18] addshore: that would be one way to test it [08:54:29] also the new server [08:54:32] but the main thing here [08:54:43] yes, we are looking forward to being able to stop writing to the old table, but we ideally wanted to make sure we could get passed this half way mark without too much pain first [08:54:51] is to know why, if you know why, any problem can be solved [08:55:09] yupp, after what you have said the why sounds like it might be the writes! [08:55:16] thanks for the chat! will report back with findings! [08:55:17] I am not saying it is [08:55:21] I am asking questions [08:55:30] please don't think I have all answers :-D [08:55:44] you might think you dont, but you often do ;) [08:55:58] there is a bit of wishfull thinking there [08:56:01] but [08:56:26] writes create locks, that creates stalls that creates lag [08:56:31] so it could be [08:56:45] we can also enable the slow log [08:56:55] and get an accurate picture of what is going on [08:57:08] to evaluate what is slow and why [08:57:41] and we also have 1 more patch to merge and 1 more patch to write splitting up some transactions that can be made smaller [08:57:42] It is always: detect synthoms -> debug -> mitigate -> iterate :-) [08:57:59] but you always have to have in mind what is the problem [08:58:05] I might come back asking for the slow log once I tried these other 2/3 things :) [08:58:05] splitting things is ok [08:58:27] but one has to have clear the advantages and whys [08:58:49] as I said to amir yesterday, in some cases splitting may only move the problem to the app layer [08:59:29] it is ok to go slow on planning, is what I am saying [09:00:31] Yup! Amir is on vacation from today, but indeed, small steps, evaluate the impact, and re measure etc :) [09:01:00] specially root cause analysis as in "big picture", not just X query is slow [09:01:14] ah, now queries use more CPU because X [09:01:21] or read more query because Y [09:01:35] or writes are stalling each other because Z [09:01:53] we have debugging tools for that, but not enabled all the time [09:56:29] jynus: FYI backup1001 has puppet disabled since a week, it got removed from PuppetDB, hence it's reported in the Netbox report. [09:57:56] enabled [10:04:19] thx [10:21:54] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10jcrespo) [11:42:55] FYI paused the item term bulk migration at Q50 million, will try having reads up to Q6 million again in a few hours (im heading out momentarily though) [11:43:34] cool, ping me when done [11:44:12] will do [12:21:16] going to increase the reads again now and monitor what happens (back to 6 million) [12:29:09] no immediate spike in connections this time (reading from Q6 million) now that the migration script has stoped, i'll leave it there for a bit and see how it develops [12:34:46] will test at 8 million as well [12:41:50] 8 million also seems fine, I'll leave it at this level and monitor [12:47:03] actually [12:47:09] let me know before... [12:47:13] I think I am late [12:47:19] I wanted to try somthing [12:47:28] which was warming up the new tables [12:47:38] to see if it was just a cache issue [12:48:01] and we are just trying to read 1TB of cold data from disk [12:48:04] well, the new tables (up to 10 million) should already be getting rad from a bit (from wikidata.org) [12:48:12] *read [12:48:24] clarly not with the same pattern [12:48:34] or otherwise there wouldn't be issues :-D [12:48:55] let me try it before the next push [12:48:57] well, the # reqs from all clients is drasticaly more than just from wikidata.org [12:48:59] ack! [12:49:08] but I need to know involved tables [12:49:21] could you tell me the table names and the range it would be "exposed" [12:49:29] e.g. normally taht would be the primary key [12:49:35] so, at 8 million now, and I see process list slightly increased, and more processes spending some time in "sending data" but all appear to be within normal operating ranges currently [12:49:36] * a PK range [12:50:07] https://usercontent.irccloud-cdn.com/file/dhR4QcIW/image.png [12:50:21] let's check the master [12:50:30] and the main replica [12:51:41] that is: master https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104 [12:52:16] and main replica: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104 [12:52:23] migration is paused still currently, [12:52:38] I can see it it is much less qps overal on all servers [12:54:21] however, main traffic pattern changed at 8:30, I am going to guess based on edit patterns [12:55:05] yup, 8:30 was you killing that query, causing edit rate on wikidata to increase to normal levels [12:55:43] worst response time decreased a bit :) [12:56:21] open connections literally just spiked though.. [12:58:39] it is mostly DatabaseTermInLangIdsResolver::selectTermsViaJoin [12:58:58] yupp [12:59:30] but the question is, is it intrinsecally slow, or is it because cold caches? [12:59:47] that is why I wanted to warm up the tables first, to discard that [13:00:04] Okay, well, i am getting a haircut in 15 mins, so I can revert back to Q2 million [13:00:18] and in an hour or so we can warm the caches and try again to 8 million? [13:00:23] yes [13:00:27] great! [13:00:27] but before you go [13:00:30] yup [13:00:34] could you point me to tables to warm up [13:00:38] yup! [13:00:39] is it the ones of that query? [13:00:47] I'll give you a full list shortly [13:00:51] between 2 and 6 million [13:00:53] ok [13:04:04] so the tables would be: [13:04:06] wbt_item_terms [13:04:06] wbt_term_in_lang [13:04:06] wbt_text_in_lang [13:04:06] wbt_type [13:04:06] wbt_text [13:05:18] If you can choose to warmup parts of tables, then I could give you a range based on the PK for wbt_item_terms [13:05:32] but the other tables would just have to be warmed up in their entirety [13:07:07] okay, back at readin up to Q2 million now [13:09:42] * addshore is off for ~ 1 hour now [14:05:21] * addshore is bacak [14:06:13] jynus: if your ready to warm some caches, im ready to go to 8 million again :) [14:06:36] sorry, I just saw the list or tables now [14:06:50] ack, no rush :) [14:07:04] what is the range? [14:07:38] wbit_id 2M to 8M? [14:07:52] hmmm, nope, wbit_item_id 2M to 8M [14:07:59] I see [14:08:12] is wbit_item_id the access method too? [14:08:22] primary access method? [14:08:35] Yes, one of the primary access methods [14:09:05] ok, if I select by it it will warmup the pk too [14:11:20] I will only select 1 millon rows at a time [14:11:27] ack! [14:12:31] the other tables are accessed by pks? [14:12:52] should I join them the same why the join query did? [14:13:16] *way [14:13:42] could you help me build a coherent query, you may know the field names better [14:13:47] right now I am running: [14:13:57] select * FROM wbt_item_terms wbit_item_id WHERE wbit_item_id BETWEEN 2000000 AND 6000000 ORDER BY wbit_item_id LIMIT X000000, 1000000; [14:14:21] Yes! *finds a link* [14:14:46] just need the on conditions with the table fields [14:15:00] I pust come handy dandy queries in https://doc.wikimedia.org/Wikibase/master/php/md_docs_storage_terms.html let me fornm one into something with a range [14:15:30] https://www.irccloud.com/pastebin/xaD5pLA9/ [14:15:32] yeah, I think I can just reuse the first one [14:15:44] thanks [14:15:54] let me limit the range first [14:15:58] ack! [14:16:27] so basically the cache warming is just doing some select of the bits of the tables your going to be reading from? :) [14:16:28] the thing is, if there is only 1 of these long running queries [14:16:43] it should not cause many issues, but maybe if 500 run at the same time [14:16:46] it may [14:20:22] so there are 100 million rows in that range [14:20:35] just on the base table [14:21:23] tip (only if you want to hear it) [14:21:37] i do :) [14:21:38] BEtween is more elegant when possible [14:21:43] ack! [14:23:13] and if you are on the mysql command line and want to test speed but don't care about data [14:23:26] you can execute on mysql client pager cat > /dev/null [14:23:38] and data will be discarded (client side only) [14:23:54] dd may be more efficient, haven't tested it [14:25:07] thinking about this db design again now seeing how it is performing in the wild, I'd probably change 1 thing about it :P [14:25:24] I hope we can actually use it without having to change that 1 thing :) [14:25:25] uh, that is a lot of data, I selected only a range of 100000 and it it taking quite some time [14:25:36] mysql? [14:25:40] :-D [14:25:42] hahaha [14:25:48] the data? [14:26:33] I think I would split wbit_item_id into 3 tables, one per term type, and get rid of the term types table. [14:26:44] Moving a tiny bit of lifting into the app and away from the db [14:27:11] would that be meaningful, aren't most items 1 type? [14:27:21] the type is the term type. [14:27:28] label alias description [14:28:00] I am not agreeing or disagreeing, I don't have the schema in mind [14:28:04] :) [14:32:53] the thing is, with a first thinning, all other optimizations would be way easier [14:34:03] yup [14:34:05] I've reduced the batches to 100000 [14:34:09] ack [14:34:10] ids [14:35:34] which is aroung 1M rows * number of tables, assuming a 1:1 ration [14:35:40] *ratio [14:39:32] see this: that's me: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&fullscreen&panelId=20&from=1582803562162&to=1582814362162 [14:39:51] we are scratching disk [14:40:04] many many data reads :D [14:40:24] data reads == buffer pool miss [14:41:05] so I was right there is a cold cache, whether that is the cause or not, we'll see [14:44:26] looking at when i made the config change earlier there is an increase in the data reads, also indicating the cold cache [14:44:49] it takes around 1 minute per 100K ids progress, let me know how far you want to reach [14:45:08] 8 million would be great [14:45:21] which funnily is one of the things we want to avoid, wb_terms taking a lot of cache size [14:45:21] if this is a process i can do / babysit I'm more than happy to take it off your hands [14:45:32] not as much the disk it takes [14:45:56] memory (or lack of it) is what makes things slow, not as much disk by itself [14:46:03] sure [14:47:30] P10541 [14:47:34] https://phabricator.wikimedia.org/P10541 [14:47:52] you may have to adapt it to sql.php or whatever [14:47:58] I am running 2 passes [14:48:08] but don't go overboard [14:48:27] let me stop it at 3 million [14:48:34] okay! [14:48:39] so you have a milestone [14:50:14] for some reason, the further down, the more time it takes [14:50:28] could it be more overal rows connected? [14:50:51] Or that the lower Q rows are more likely to already be in the cache [14:51:08] * addshore tries to find the equvilant of mysql.py [14:51:11] I've stopped pass one [14:51:25] I think it is sql.sh or sql.py on debug/deploy [14:51:32] something like that [14:51:41] I know it exists, I don't know the details [14:52:04] if after that, db1126 is ok, but the others get overloaded [14:52:09] it is the cache [14:52:09] looks like it is sql wikidatawiki --host db1126 [14:52:43] 2nd pass goes around 5x faster [14:52:55] but makes sure you make at least 2 pases over the same data [14:53:06] okay! [14:53:10] and not a lot of time passes between that and the deploy [14:53:18] as they will get cold again [14:54:09] I think it is worth spending a bit of time on this, at least as a debugging issue [14:54:19] yup [14:54:36] I think I'll warm up 1 million at a time now and then switch the reads on for that segment until I get to 8 million [14:55:21] I stopped everything now [14:55:30] Great, I'll start mine shortly :) [14:56:18] check also: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&fullscreen&panelId=13&from=1582813795368&to=1582815252224 [14:57:13] that 0.02% could be the reason to tip over resource limits [14:58:21] ack! [15:16:37] 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10jcrespo) I will let @Marostegui put it back to 100% and do the full revert and finishing touches + resolv. [15:35:54] ran through to 4 million once, now running a second time, then will start reading up to 4 million, and then repeat etc [15:40:31] jynus: so I saw a similar increased processes etc pattern during the cache warm for 2-4 million, https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1582814597914&to=1582817769859&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&fullscreen&panelId=37 [15:41:10] I wonder if I will also see that on the second pass. RIght now the second pass seems to be going at a more consistent and slightly faster speed, which is encouraging [15:55:13] generally looking pretty good right now, nealry at the end of the second pass over 2-4 and no spikey connections etc nd less data reads from disk! [15:57:48] well, the real test will be the deployment [15:57:57] this is just 1 or 2 connections [15:59:30] yup [15:59:46] It's going to take a little while to warm the cache up to 8 million, but I think i'll fit it in today [16:15:47] I temporarily added the "wikidata events" annotations to the main sql dashboard too, so that I can overlay them on main replica. Fully intend on removing them once this migration pain is over [16:16:59] main sql? [16:17:05] you mean the one called mysql? [16:17:12] https://grafana.wikimedia.org/d/000000273/mysql [16:17:13] yup [16:17:15] np [16:17:35] adding period annotations for the cache warms as well as the deployments [16:17:37] which channel do you use? [16:17:43] as in, source? [16:17:49] we may keep it but make it more general [16:17:56] source is grafana with the wikidata tag currently [16:18:11] ultimately the annotations are being created on https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms [16:18:16] ok, not important now [16:18:31] as in, what we will do with those in the future [16:19:22] curious question, if this cache warming seems to be important, what happens in the case of a DC failover? [16:19:50] well, first there is actually warmup when we do a db failover [16:20:14] but normally replication takes care of it automatocally when in normal usage [16:20:30] ack [16:20:34] plus we will not have the wbterms table taking 1TB of data [16:20:37] :D [16:41:48] changint time for "|pv > /dev/null" at the end is a bit more useful and less boring [16:51:37] So I think there is one underlying reason why this may be happening [16:51:44] which caching will help [16:52:30] there is no io limitation, the main issue is cpu/load exhaustion [16:53:12] this was completed this week: https://phabricator.wikimedia.org/T232446 [16:53:43] which increases cpu usage but saves space (because otherwise, we would run out of space on wikidata dbs) [16:54:03] that is, at least, a contributing factor, I believe [16:54:16] compression is slighly less performant and more cpu intensive [16:54:33] but it was either that or running out of disk space [16:54:45] so, we just need more machines [16:55:18] which is ok, better than having to throw away the existing ones [17:00:53] the good news is that it is super easy to test- put as main load one of the host with uncompressed tables [17:21:14] so, at 6 million now, which is where things initially got hairy [17:25:06] my best guess is to monitor the first line at https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1126&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=1582813491235&to=1582824291235 [17:32:53] Uncommitted dbctl configuration changes- check dbctl config diff is alerting [17:33:06] but not cumin1001, because on neither there is pending changes [17:35:39] addshore: I think I am as sure it is increased cpu usage due to compression [17:35:57] because it happend much worse on db1126 than on other hosts, which are not yet compressed [17:36:06] gotcha [17:36:17] good news: nothing you can do about it [17:36:22] bad new: nothing you can do about it [17:36:24] yes, I didn't really see any issue on any of the other hosts :P [17:36:26] xD [17:36:28] the solutions is to buy more machines [17:36:36] but it is not really correalting with the deploy [17:36:42] well, it is [17:36:54] but indirectly "the new tables are now compressed" [17:37:09] I remember thinking of crazy idea last night, like, why dont we replicate the wikidata terms onto alll shard, mwhahahahaaaa [17:37:12] db1087 will help [17:37:20] and then we will buy more [17:37:36] the issue is we are compressing in the first place [17:37:39] due to wbterms [17:37:45] or we would run out of space [17:37:56] so it all comes back tho the same original issue [17:37:59] Are there any metrics for the # of queries to s8 that come from sites other that wikidata.org ? [17:38:08] I wonder what the split actually is in demand [17:38:38] not really, but the data is possible to get [17:38:44] it is for the errors [17:38:50] So is the compression probably only a CPU hit when reading from disk? and when in memory its less of an issue? [17:38:56] but we don't log normal queries because it would be impossible [17:39:06] not disk [17:39:12] I think the caching will help [17:39:20] because it decompresses the row into memory [17:39:27] gotcha [17:39:38] but it has to read disk compressed (but see iops is not crzy) [17:39:46] I'v had quite the DB journey over the past week [17:39:48] however, comapare cpu with enwiki [17:40:22] top enwiki host is db1118 [17:40:46] * addshore looks [17:40:46] which is at 35% utilization [17:40:57] vs db1126, at 60-80% [17:41:12] load 6 vs load 20 [17:41:21] ah yes [17:41:39] the hw is not optimized for the kind of wikidata workloads + compression format [17:41:46] but also not for the size of the db :-D [17:41:56] so we have to comprimise on something [17:42:03] https://usercontent.irccloud-cdn.com/file/7LwsGaAG/image.png [17:42:05] more machines + compression should be [17:42:09] ^^ is that the point of compression? :P [17:42:24] I can check [17:42:41] im looking forward to the db being smaller [17:42:45] Compress db1126 - T232446 Tue, Feb 4, 08:15 [17:42:46] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [17:42:57] finished around the 7th [17:43:04] yup, that lines up then [17:43:23] to be fair, compression on enwiki doesn't create so much load [17:43:35] so it was something that we couldn't anticipate [17:43:47] software also contributed- changed the kind of workload [17:44:03] for something "different", not necesarilly bad [17:44:25] so, new host tommorrow? With that in mind, and the fact that we are now stable at 6 million, I might leave it here [17:44:33] for now [17:44:38] new host, monitoring cpu saturation [17:44:48] and purchase more machines [17:44:53] and also hopefully [17:45:00] Still need to try to turn writing back on for the migration [17:45:06] after we are on new system [17:45:10] also requirement will lower [17:45:34] beacuse writes to both also has a cost on all hosts [17:45:43] (compression to write) [17:46:03] the thing is, once we have pin-pointed the underlying issue [17:46:11] there is many possibilities [17:46:39] sometimes is hw, sometimes is sw, and I would like to be on top of everyhing, but it is difficult [17:46:59] also we have to take some shortcuts- e.g. more testing on compression [17:47:27]