[08:02:02] jynus: I think we need to update the reference on Tendril for T106303 , do you know what to rename or easier to remove and re-add? [08:02:02] T106303: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303 [08:02:56] drop and recreated will be easier [08:03:38] ok, I'll take care of it [08:04:09] tendril-host-disable tendril-host-drop tendril-host-add tendril-host-enable [08:04:33] T136136#2325624 issues with the master too? [08:04:33] T136136: db1056 BBU Failed - https://phabricator.wikimedia.org/T136136 [08:56:31] jynus: the grants generated by mariadb::grants are applied automatically by puppet? on labservices1002 the file is there but looks it was not applied [08:57:41] looking at puppet code I don't see anywhere where they are applied, so probably not (on purpose of course) [08:58:52] no, they are not [08:59:16] there is a mess with how grants are handled now, that is the best option for now [09:00:19] hey, 6 days without tendril crashing, that is an improvement [09:00:29] ok then I should apply only tendril one to be on the safe side? [09:00:49] ok with that [09:09:09] and btw the watchdog one is not even in the grant files :( [09:09:57] there must be a mismatch- new servers are supposed to use tendril as user [09:11:13] mmmh then could be the config on neodymium of my tendril, let me check [09:11:31] I think old ones used watchdog [09:11:51] yes, was that, I probably had the old one it get's the user from the my.cnf [09:12:46] it is one of those things that I will have to solve at some point, but tendril may disappear first [09:27:56] for the unification of production and production-es my.cnf files, the last parameter is the max_conn, I suggest to add it as a param on mariadb::config passing to it directly the value instead of a more misterious high/low/etc. values. Thoughs? [09:28:16] what is the value of max conn? [09:29:06] 5k for s and 10k for es [09:29:11] s* and es* [09:32:01] put everithing to 10K [09:32:32] that way we colapse app servers and it is handled by nodepool [09:33:36] I think some old DB could go OOM if we really reach that number of conns [09:33:55] we will get rid of old dbs [09:34:25] unhandled connection take only like 15K or so [09:34:58] it is concurrent runing connections that can cause that, and we have that limited to 40 or so [09:35:05] ok, I was thinking old but >1050 that will remain for specific roles [09:35:28] also, I am setting a watchdog for that [09:36:00] ok [10:41:57] <_joe_> sooo, I have an interesting question for you people. It turns out that 81 of the 94 wikis that will need me to run updateCollation are in s3 [10:42:31] <_joe_> any chance we can run the script on two wikis on the same shard at the same time? [10:42:51] well, given that there are 850/900 wikis there, it is not surprising [10:43:13] <_joe_> also, say I want to do a count(*) on categorylinks on a few wikis, where should I do it? [10:43:21] for small wikis yes [10:43:48] I would test how much concurrency we can get for medium-sized ones [10:44:00] do not do count [10:44:17] do a show table status, that will be faster for what we need [10:44:19] <_joe_> what should I do instead? [10:44:26] <_joe_> heh right [10:44:51] I send people to dbstore1002 that do bad queries [10:44:58] <_joe_> jynus: any codfw slave would be ok? [10:45:00] _joe_, go to dbstore1002 :-) [10:45:09] <_joe_> ok, dbstore it is :P [10:45:39] <_joe_> jynus: btw, nore dewiki nor enwiki nor commons need to be run [10:46:19] that is good news [10:46:35] commons category links is one of the largest tables everywhere [10:47:12] the problem with s3 is that it does not usually have a lot of load [10:47:20] but sometimes the issue is concurrency [10:47:34] remember the wikidatawiki bug, that create a thread per wiki? [10:48:09] <_joe_> what's the wikidatawiki bug? [10:48:22] <_joe_> oh, right, now i remember [10:48:25] I thought you remember, it was an example [10:48:28] the problem with a server with so many wikis is concurrency [10:48:47] so it can happen, just let's see how much [10:48:55] <_joe_> yes, i will come up with a plan [10:49:04] <_joe_> and let's see how it goes :P [10:49:19] <_joe_> ofc I'll ask you for a feedback [11:00:47] <_joe_> is dbstore a labs host? [11:00:59] <_joe_> as in, using labs authentication? [11:04:15] no, it is fully production [11:04:23] it just has no mediawiki taffic [11:04:29] <_joe_> yes, just a scp config fart [11:09:35] BTW, dbstore is good for statistics, but it uses (still) tokudb, so do not trust it for things like query plans or latencies [11:10:00] if you need production-like hosts, go to codfw [11:12:30] _joe_, I am seeing the script, I have things for that semi-prepared for next time [11:16:18] <_joe_> jynus: so, all the 81 wikis in s3 have collectively more or less the rows of svwiki [11:16:22] <_joe_> which is in s2 [11:16:28] <_joe_> so s2 >> s3 :P [11:17:45] that usually means that running them serialy would not be an issue [11:18:12] <_joe_> and s6 >> s2, as frwiki has 33 M entries [11:18:13] do you have an estimation? [11:18:22] <_joe_> ongoing [11:18:24] <_joe_> :P [11:18:32] are you counting? [11:18:41] <_joe_> we averaged around 1.5M records/hour when I ran my test on ptwiki [11:18:58] <_joe_> so I'd expect frwiki to take around one day to run [11:19:54] that is ok, that is what normally takes a schema change [11:20:03] I am not worried about the change [11:20:26] I wasn't worried before, they only asked me for help for optimization [11:20:42] because the script was well written [11:20:55] (the problem is when it is not) [13:55:09] lock_wait_timeout scares me (and yes, I know I suggested it) [13:56:35] but after discovering that some processes taking minutes to execute, I wonder if we would break something [13:57:19] I do not like @processcount at all [13:57:51] we had lots of issues with small sizes in the past [13:58:14] what do you mean? [13:58:18] for processcount [13:58:27] pool_size [13:58:38] it already defaults to the number of processors [13:59:02] no, it's set to 32 on production and to 40 on production-es right now [13:59:05] and with 16 or 32 values, we had a lots of thread / connections issues [13:59:22] 16 then [13:59:47] some 29 [13:59:50] *20 [14:00:19] see https://phabricator.wikimedia.org/rOPUPb4113805e638bfaa0794d93543d0a2b89a6bf24c [14:00:57] even 32 may be low [14:01:24] yes, for this the patch sets it minimum to 32 on small cores machines and the number of cores on the others, we can increase it [14:02:13] db1065 saturation [14:04:24] quite loaded [14:04:38] ApiQueryContributors::execute as usual [14:04:49] nothing to see there [14:05:24] raid/bbu/disks all good [16:28:47] so for the my.cnf... what do you think could be a good value for the pool_size? [16:29:13] thread or buffer? [16:29:42] thread :) [16:29:56] it was ambiguous [16:30:19] trying 40 for the largest servers? [16:30:53] not sure about the smallest ones, but not too small [16:31:52] "largest" based on? [16:32:04] because db1078 for example has 16 cores [16:33:51] my code was just trying to reproduce the current behaviour, but I might have made the wrong assumption [17:05:18] the current code set it to 32 to all hosts with 32 or less cores, and to the number of cores otherwise [17:08:03] "how many concurrent threads can be run at the same time" [17:08:22] that's the definition, I know :) [17:08:29] which I know it is difficult to say without testing [17:08:42] hard to give a proper answer [17:08:43] but smaller values gave us problems in the past [17:08:57] ^that is what I know [17:09:58] and not only throughput is important, also connections breaking because pileups [17:10:16] at some cases it takes >3 seconds to connect [17:11:58] right now I would leave it as it is, then put a TODO: Needs tuning [17:13:42] ok, but given that I don't have the shard on the mariadb::config, I though this could be a way to have the same result and scale automatically in case of new hardware with lot of cores [17:14:11] yes, and I agree with the idea [17:14:52] but experience tells me "this created issues before, danger" [17:15:18] it didn't want to be the final answer to thread_pool_size [17:15:26] the idea would be to test the 3 or 4 different server we have and do a thoroughput test [17:15:45] sure, but wasn't it going to be 16 for the largest servers? [17:17:21] if @processcount <= 32 ==> use 32; else use @processcount [17:17:32] the minimum will be 32 [17:17:47] where is the change, I cannot find it [17:18:00] also for an hypotetical single core machine, a bit crazy probably, but we don't have those [17:18:04] https://gerrit.wikimedia.org/r/#/c/286858/4/templates/mariadb/production.my.cnf.erb [17:18:10] it's your change... [17:18:32] I cherry picked yours and started from there... so git/gerrit keeped the same CR you sent :) [17:18:51] ok [17:18:57] did it change? [17:19:21] not in the last few hours [17:19:41] because I read it as pool = @processcount [17:19:43] sorry [17:19:52] did I wrote that? [17:19:59] lol [17:20:05] no I did [17:20:10] ah, ok [17:20:17] the whole if/else block [17:20:19] so I am ok with how it is [17:20:35] I missread it as thread_pool_size = <%= @processcount %> [17:20:40] sorry [17:20:47] didn't see the if [17:20:51] no problem :) [17:20:52] +1 [17:21:13] the diff is to blame [17:21:33] and I told you before I am blind [17:21:57] I was discussing that @processcount just as it is could be too small [17:22:13] but with the cap of 32, I am ok [17:23:07] that's why I was not understanding your worries, but I didn't explain myself well enough :) [17:23:17] if you have the time, change a couple of comment, the "except external storage" [17:23:23] on line 2 [17:23:44] oh yeah I forgot that [17:23:50] and the "controled by cluster control" "controlled by orchestration" (TM) [17:25:58] lol, by Skynet [17:35:46] updated, gotta go right now, ttyl [18:09:20] bye