[01:14:48] +1'd T105843#1479238 [03:38:42] dbstore1002 is quiet; quickly doing the upgrade [12:56:27] jynus: what was question about static index stats? [12:56:52] also, no hurry. will be around for hour or so [12:58:13] Oh, I happen to be around [12:58:35] it is just that I was having lunch, but someone pinged me [12:58:52] :) [12:59:01] (not you, I mean) [13:00:24] this is not critical: any input on table statistics handling for T106682 [13:00:37] just reading that now [13:00:50] I think you mentioned regenerating them on dbstore then copying to other hosts? [13:01:21] I want to avoid if possible STRAIGHT_JOIN / FORCE [13:01:47] the ticket itself is not so frequent or critical [13:02:16] ah more STRAIGHT_JOIN // FORCE edge cases. not surprised at all that SpecialWhatLinksHere::showIndirectLinks causes trouble again [13:02:43] yes, it only happens on templates with many transclusions [13:02:46] so, I tried using engine-independent index stats, mainly on enwiki [13:03:00] succesfully? [13:03:12] they do help eliminate edge cases, but while STRAIGHT_JOIN / FORCE remain in some queries, they are prevented [13:03:27] so, i doubt they will help here [13:04:20] so, you'd recommend S_J, then (with all the time we have left) investigate further [13:04:31] possibly the query rewriting code in 10.0.16 might help in the interim [13:04:37] * springle finds example [13:04:52] oh, i did not know about that [13:05:11] that would be very helpful for another query issue with FORCE [13:05:36] is the plugin and backend support backported? [13:05:38] it's my hack. not pretty, but handy when FORCE or S_J breaks just one or two wikis [13:06:08] ok, only knowing that you have already help me a lot [13:06:23] see db1065 /srv/sqldata/rewrite.rules [13:06:35] each rule is foure lines [13:06:37] four* [13:06:47] there is actually another more imminent issue [13:06:49] two regex matches (first must be mediawiki comment string) [13:07:00] that I want to ask you about [13:07:03] one pattern to rewrite, and one replacement [13:07:05] ok [13:07:26] db1035, I f* it up and restarted it on saturday [13:07:51] no user issue, load banlancing worked and all [13:07:52] always a fun way to spend your weekend :) [13:08:09] I tought I repooled it too soon [13:08:22] but after days, jobs fail [13:08:32] I depooled it again and open every single table [13:08:44] for statistics/table descriptors [13:09:11] but even depooled it is having write issues (lag) with replication [13:09:37] because it is s3 I assume too many objects issues [13:10:31] but this is even on iflatc=0 and no binlog sync [13:10:37] what went wrong on the weekend? [13:10:55] you mean why I restarted it? [13:11:06] yes [13:11:08] or the errors I am getting [13:11:11] both [13:11:11] my mistake [13:11:16] just for context [13:11:24] (mistakes happen plenty) [13:11:30] no reason, I restarted the wrong server [13:12:34] even with reduced load: 400 vs 40 it is the top host failing right now [13:13:10] all from /rpc/RunJobs.php [13:13:23] user/api connections are fine [13:13:56] so only jobrunner clients? [13:14:10] yes [13:14:36] even at top load, no user connection failed [13:14:54] very strange [13:14:55] so maybe configuration-related, not warm-related [13:15:02] that helps me [13:16:11] tendril will have historics of variables? [13:16:32] maybe something was change dynamically [13:16:40] *changed [13:18:22] run this from db1011 to compare variables http://aerosuidae.net/paste/2f115/55b62d46 [13:18:33] as for variable history... hmmm thinking [13:19:07] oh, I changed all variables to those of 1044 [13:19:12] first thing I did [13:19:17] ah ok [13:19:23] what needed changing? [13:19:34] it is on the logs up here [13:19:39] flush tables [13:19:43] sync binlog [13:20:16] and connect_timout [13:20:36] the strange thing is that when only 1044 is pooled, no errors [13:20:44] (even if it has way more load) [13:21:19] strange that connect_timeout is 5 on db1044 too, but prob unrelated [13:21:36] since db1035 still shows errors with =5 [13:22:32] yes [13:22:52] and I can reconfirm you that only relevant differences (apart from serverid, logs names) [13:23:09] is dump_buffer_pool_at_shutdown/start [13:26:21] to be fair, I did not expect you to come up with a solution right away [13:26:34] just maybe if it had happend to you in the past [13:29:59] consistently higher %sys cpu on db1035 from before the restart http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=cpu_system&mreg[]=%5Ecpu_system%24&hreg[]=%5Edb10%2835%7C44%29&aggregate=1&hl=db1044.eqiad.wmnet|MySQL%20eqiad,db1035.eqiad.wmnet|MySQL%20eqiad [13:32:24] dunno yet what that means, if anything. %iowait looks odd, but maybe spiky just due to reloading data or s3's many tables? http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=cpu_wio&mreg[]=%5Ecpu_wio%24&hreg[]=%5Edb10%2835%7C44%29&aggregate=1&hl=db1044.eqiad.wmnet|MySQL%20eqiad,db1035.eqiad.wmnet|MySQL%20eqiad [13:32:47] I do not see those, low cpu when depooled, unless you are pointing me to something else [13:33:45] So, on the SAL it is summarized but the sequence of events is: [13:34:24] restart- stop slave- depool- start slave- repool low - repool high- depool today- repool low today [13:35:15] spikes on depool due to the start slave + lag [13:35:33] and recently due to me forcing innodb stats by opening all tables [13:36:50] was merely observing db1035 shows different load to db1044 before the problem, which might be useful data point for figuring out why it cannot get back up to speed [13:36:53] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=cpu_system&mreg[]=^cpu_system%24&hreg[]=^db10%2835|44%29&aggregate=1&hl=db1044.eqiad.wmnet|MySQL+eqiad%2Cdb1035.eqiad.wmnet|MySQL+eqiad [13:37:03] oh, before you menat [13:37:11] just gathering information, not saying a cause yet [13:37:18] yes of course [13:37:49] but since the restart, lag spikes are way worse on db1035 than db1044, even during being depooled [13:38:08] that pointed me to write issues [13:38:12] i am unsure if your forcing stats and opening all tables could result in different %iowait for an entire day http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=cpu_wio&mreg[]=^cpu_wio%24&hreg[]=^db10%2835|44%29&aggregate=1&hl=db1044.eqiad.wmnet|MySQL+eqiad%2Cdb1035.eqiad.wmnet|MySQL+eqiad [13:38:17] yes [13:39:05] that is me depooling db1035 [13:39:47] getting a bit higher in the end after repooling with a 10% less load [13:40:48] and those spikes yes, I think they are my processes (I started them twice) [13:41:10] question is errors are connections failing [13:41:24] not queries being killed because too much lag [13:41:36] (what I thought at first) [13:41:58] so maybe something pool or permissions related [13:42:23] look at tendril, connections failing [13:43:54] it could even be network, but that would be weirder: onlt one class of connections would fail? [13:46:39] memory buffers are higher than before/1044 [13:48:39] BTW, the reason I was on 1035 was because the alarm of less than 10% disk space went off [13:49:16] yes, i saw T106847 [13:59:18] I will find it, and when I do, it will be a good thing because it is happening under a control environement [14:00:58] it could be like dbstore1002, just something related to table statistics [14:06:57] yeah, this is strange one. i guess we need to trigger the aborted connects again, set log_warnings=2 just to see if mysqld knows anything useful (probably not), then tcpdump [14:07:01] * springle shrug [14:08:26] I am looking at the relay log, at least to try to see some pattern for the lag [14:14:42] bah. oom killer just fired on tendril [14:14:57] i know... needs a slave [14:28:26] huh, db1044 doesn't have LVM setup [14:28:32] consistency fail [14:30:52] ha [14:31:17] well, all thous should be replaced next... [14:34:45] swap in swap out: 706 21620 [14:34:53] free [14:35:14] this is clearly not well dimensioned [14:35:17] or [14:35:32] something is makeing not well dimensioned [14:35:37] *it [16:27:51] ok [16:28:40] I am starting to think that db1035 errors were amplified by the reboot, but not directly related [16:29:05] I am seeing the patterns on other shards now [19:46:58] I've created T107072 [19:48:04] I've set db1035 in what I think it is the best option possible: pooled but at 40 weight (vs 400 of db1044) [19:48:16] let's see how that evolves