[07:48:33] morning [07:49:54] hi [07:54:34] I saw the crashed and killed DBs... anything urgent to do on them? [07:54:55] could you give a general look to both hosts [07:55:21] sure [07:55:22] I only did the most obvious things, but examining the error log, replication issues, etc. would be welcome [07:55:45] then close one there is no more obvious things [07:56:01] in general, try to se a cause- eg OOM [07:56:07] or if they need an upgrade, etc. [07:58:02] ok [09:06:31] do we still have somewhere tendril stats at 5m from yesterday? Or are they already aggregated? [09:08:11] <_joe_> jynus, volans I am going to run the updateCollation.php script in beta in a few [09:08:20] <_joe_> is that ok with you? I'll start with some small wiki [09:09:59] in beta, no issue [09:10:17] I assume the index has been already created there automatically [09:10:35] but the schema change on production is still ongoing [09:11:03] <_joe_> jynus: I know, but I want to test the procedure in beta for once :P [09:11:05] there were *lots* of category links on commons [09:11:51] so many that update collation may take weeks to run [09:13:24] <_joe_> let's hope not [09:14:00] it may not need to run there [09:14:21] only on wikis with special collations (aka non-english) [09:22:37] <_joe_> jynus: running it in beta, it does about 1K rows/sec [09:25:51] there are 200 millon of those on commons [09:45:10] volans, re: db1034 can you check the role of that server, should some config be changed, etc? [09:45:24] maybe check some queries runing there [09:45:53] already checked all queries since 24h ago, nothing special on slow queries before the kill, just after [09:46:19] I'm looking at the memory config, it's the watchlist/recent/etc... of s7 [09:47:38] mmm, interesting, maybe checking the partitioning [09:47:45] of logging and user [09:54:41] I've checked all the db10* and apart db1011 (tendril) and sanitarium that have swap full as always, only db1044,47,10 are low on swap. I'd might take a look at those too later [09:55:16] what do you mean with low on swap? [09:55:27] low on free swap [09:55:34] ah, ok [09:55:44] 47 is analytics, expected too [09:55:58] actually sanitarium only since May 20th morning [09:56:58] that could be the schema changes + recent reboot [10:00:22] so only db1044 to check [10:10:06] <_joe_> jynus: I was thinking we might want to test the updateCollation script on some large prod wiki where the schema change has been applied [10:10:21] <_joe_> so that we can resolve https://phabricator.wikimedia.org/T58041 or go back at fixing things [10:10:35] <_joe_> jynus: which one would you suggest? [10:10:59] in theory, it has been applied to s1, s2, s4, s5 and s6 [10:11:16] but I have not yet verified that it has applied to all servers, all wikis there [10:12:02] <_joe_> it would be useful to pick one and verify that that is the case (I can do it, btw, no need for additional work for you) [10:12:13] <_joe_> one == one wiki [10:12:49] you have the dblists on mediawiki-config [10:12:56] <_joe_> yes [10:13:05] you can use the mysql_shard grain from neodymium ;) [10:13:54] s2 is what I usually chose for starting changes [10:14:10] <_joe_> any wiki is ok? I would try frwiki since it was reported to be incredibly slow [10:14:11] not tiny wikis, not large, no one special [10:14:14] <_joe_> ok [10:14:23] <_joe_> so s2 instead... itwiki IIRC is there [10:14:27] (special like commons or wikidatawiki) [10:14:37] <_joe_> it's of comparable size to frwiki, too [10:15:11] <_joe_> or well, something smaller first [10:15:43] mysql -e "SHOW CREATE TABLE categorylinks\G" | grep would be enough [10:16:06] mysql -e "SHOW CREATE TABLE categorylinks\G" | grep [10:16:17] mysql -e "SHOW CREATE TABLE categorylinks\G" | grep [10:16:17] _joe_: sudo salt -G 'mysql_shard:s2' cmd.run 'mysql --defaults-file=/root/.my.cnf --batch --skip-column-names -e "SHOW CREATE TABLE foo.bar\G" | grep baz' [10:16:32] <_joe_> volans: thanks [10:16:33] that is ok [10:16:35] with appropriate foo,bar,baz :D [10:17:38] after you get the output from the matching ones it takes 15 seconds to give you back the prompt, but that's salt with -G or -C ;) [10:18:44] <_joe_> volans: don't I know [10:24:33] I think db2010 crashed in the middle of the installation [10:24:55] any error on the console? [10:25:17] just garbage, but ssh from the installer worked [10:25:45] I did not retrieve the logs, just restarted the installation, only debug if it fails again [10:26:25] ok [10:34:17] seems to go faster than before [10:56:03] <_joe_> jynus: if you need to stop the updateCollations script by any chance, it's in a screen on terbium under my user [10:56:19] it is ok [11:18:10] copying data back to db2010 [11:49:45] db2010 is now using SSL, GTID and soon to be in sync with its master [11:49:55] \o/ [12:11:28] I am rethinking --skip-slave-start: it is very useful if (assuming GTID is there) on server crash, given our lack of perfect redundancy [12:12:01] specially for non critical/small services [12:14:52] this might also mean to have a server back in prod with issues? [12:15:30] yes, which is what we have now, but lagged [12:16:06] ideally we would want to avoid that, but the reality is that we do not have enough redundancy yet [12:27:15] depends on the crash, mysql could not start automatically anyway [12:28:01] actually, I would be ok with that- the problem is that mysql does restart [12:33:16] yeah but that's under our control, start/no-start skip-slave/no-skip-slave :) [12:33:33] no, it is actually not [12:33:45] (in the current state) [12:33:57] on crash, mysqld_safe restarts the server [12:34:41] which is ok, assuming: only innodb is used and GTID [12:35:03] or myisam is used and there is not much we can do about it (labs) [12:35:58] I am not saying I like that, but in the current state of things, maybe it is better to manually continue starting the server and the replication [12:36:42] until we have better redundancy, better pooling mechanisms, better mediawiki support, and more stuff to fix thing properly [12:36:48] *staff [15:23:33] jynus: would you see any problem to set thread_pool_max_threads to 2000 on slaves too? (where we didn't lowered the thread_pool_stall_limit) [15:28:53] well, do you think it will help? [15:30:32] I have nothing against it, I just want to know why you want to change it [15:30:49] right now it's just not set and has the default of 500, that I think could be low with a max conn of 5000, but no I don't have a strong argument for it [15:31:32] is the only difference left between production and production-es my.cnf and I was thinking about why they should be different [15:31:45] so you have a strong argument against not setting it, and to uniformize, and the fact that percona has a default value of 10000? [15:32:03] or better, based on what concept they should be different [15:32:26] the percona one is 100k but their implementation is different IIRC [15:32:33] I am cool with that, just enable it slowly [15:33:05] as in, test today on a single "large load" server, then start extending it [15:33:21] check differences on connection errors, etc. [15:33:52] ok, btw we got a high error rate on db1071 just around the first swat merge of today.... [15:34:22] (could perfectly be unrelated) [15:34:43] s5 it seems [15:36:30] insert and update peak, db1071 has the higher load on s5 but I don't see a peak on selects or connections [15:37:05] if you filter with "-labswiki and -rpc", it has a different story [15:38:36] I suppose the update collation script could be ongoing [15:39:17] yep [17:16:30] we replied almost at the same time to T71222 :) [17:16:31] T71222: list=logevents slow for users with last log action long time ago - https://phabricator.wikimedia.org/T71222 [17:18:03] yeah, you with more useful information [17:18:35] it all comes to T132416 [17:18:35] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [17:18:42] plus some mariadb bugs [17:18:46] yeah! [17:18:57] I still don't get why maria get confused and not use the right one [17:19:02] but those differeces, in some cases are useful [17:19:20] format=json might be useful here ;) [17:19:33] it is just a mess now to debug each individual case [17:19:54] anomie made a calculation and there are 2 million possibilities of parsing trees for api queries [17:20:04] not 2 million different queries [17:20:15] 2 million different digests [17:20:43] and you vary slightly the version and some start working and others stop [17:21:04] I think the main fix is to approach is systematically [17:21:19] yeah with so many options it's impossibile to optimize them all [17:21:43] yeah, but if we new cases 1000-2000 are bad, we could just disable them [17:22:05] also, we have a lot of variability [17:22:18] thing something like "pages that use a particular template" [17:22:28] the typical case is 1-100 [17:22:52] yes, the digest is not enough, same digest could use different indexes based on data [17:22:57] but there are templates like {{cc-by-sa-4.0}} that may be used 1 million times on comons [17:23:36] it is an interesting challenge, and if I had the time, I would like to work on it [17:23:46] but it is optimization vs. keeping things up [17:24:11] do you know how "easy" is to generate the digest from the code? if there is a way [17:24:26] yes, anomie created an alforithm [17:24:39] let me search the ticket [17:25:01] or you mean, in general? [17:25:17] that's good, we could automate the explain of them, save them in a tool and run that test with different indexes and/or when queries change in the code [17:25:27] that was the plan :--) [17:25:52] :-) [17:26:10] on db1072 with the thread_pool_max_threads all good, do you prefer I revert the change for the night? [17:26:19] no [17:26:27] ok [17:26:31] updating the ticket [17:26:43] precisely, better to detect it now than when there is 100% of the resources like that [17:27:51] the problem is also that some servers work well for API request, other for recentchanges/logging [17:28:05] I do not know that we do for recentchanges requests on API :-) [17:29:52] volans, https://phabricator.wikimedia.org/T101502#1471866 [17:30:44] ok, thanks [20:12:34] jynus: labsdb1008 just went down... I'm taking a look [20:12:49] mmmm [20:13:04] no ping from bast1001, mgmt console works [20:13:08] I cannot ssh [20:13:22] network down? [20:13:37] or just crashed [20:14:09] good question, I'm checking hpiLO commands... I'm not familiar [20:16:58] jynus: if it's network it's the only one in that rack, unless you suggest anything else to try I guess a power cycle is the only option atm [20:17:08] from mgmt: power: server power is currently: On [20:17:27] salt doesn't respond either [20:17:40] what do you get from serial console? [20:18:29] actually the login screen... mmmh [20:18:46] and responds... I mean if I type root ask me for the password [20:19:33] could be network then.. do you want me to ask Chris if can take a quick look? [20:20:06] soft reboot it [20:20:15] volans: I think that was me...i killed it on the switch [20:20:21] ah! [20:20:23] ok then [20:20:27] that is good news [20:20:27] i am updating ofr db's....fixing now [20:20:39] no problem, if you do that [20:20:45] you are welcome! [20:20:48] ahhhh good to know ... [20:20:57] not in production yet [20:21:19] others would have been more complex [20:21:34] and being a new one I was worried we had hardware issue [20:21:36] so that is one of the best ones to disconnect :-) [20:21:41] same [20:22:03] look at es2019, we had a couple of issues [20:23:12] cmjohnson1: I'm in, seems all fine, replica should auto restart in few seconds [20:23:44] volans should be back up [20:23:54] thanks!