[08:05:28] I am going to restart x1-master on eqiad to apply new cert [08:06:27] ok, but if codfw doesn't have it [08:06:30] it will not work [08:06:43] you have to CHANGE MASTER TO using the old ones for now [08:07:09] and you need the multiple-ca in puppet [08:07:18] to allow after the switch codfw to replicate from it [08:07:48] jynus: ^^^ [08:07:50] no [08:08:10] I will do it raw, then fix codfw on failover [08:09:42] what do you mean fix? restart? [08:10:00] or change master to? [08:10:20] both [08:11:04] you need to deploy puppet-cert on the codfw master too then [08:11:12] to be able to change master to using the new ones [08:11:46] and you can fix it just after eqiad restart, without adding steps to the failover [08:11:57] *switchover [08:11:59] yes, no hurry [08:12:29] it is x1-master on eqiad that I will not be able to restart [08:14:31] put codfw master replication in downtime and stop slave to avoid pages [08:14:38] I lready did [08:14:41] :) [08:20:19] it is expected to have query in execution for more than 1h on vslow servers? [08:23:50] we kill those queries after 24 hours [08:24:32] ok [08:24:42] not us, HHVM [08:25:02] vslow, surprisingly, is for very slow queries [08:26:17] yeah, but for a web environment I didn't know how slow is considered too slow to not be normal... :) [08:26:36] 300 seconds [08:29:41] for the es1/es2 thread limit bump, do you want a puppet patch or we do it manually just for the period of the switchover? [08:29:58] puppet [08:30:12] ok, on it [08:30:20] unless we see a regression, it will stay there [09:07:23] so, here is the thing- [09:07:34] there is 2 posibilities [09:08:42] either mysql with no skip-slave-start does load the wrong server certs (Not only certs that were using for replication, but those to authenticate clients) [09:09:03] or that server has issues [09:09:21] and from master.info [09:09:25] is using the new ones [09:09:35] but for replication is ok [09:09:44] how? db2009 has the old one [09:09:46] but what it was failing it the connection from another server [09:09:48] should fail [09:09:59] no, manually connecting with the new ones [09:10:03] not automatically [09:10:56] doing exactly the same change master later, it worked [09:11:00] after restart [09:12:02] quite strange [09:39:40] so, to summarize, eqiad -> codfw new cert, codfw -> eqiad old cert [09:41:25] ok, we can fix it next week restarting codfw master [09:44:22] yes [09:44:38] I will not touch es at all [09:47:04] I have a puppet patch almost ready [09:47:46] I will prepare one for dns, but that is 0 periority [09:48:14] but I like to do ssh x1-master[tab] and stuff [09:48:45] +! [09:48:47] +1 [09:54:49] * volans loves the mariadb submodule: https://gerrit.wikimedia.org/r/#/c/284662/ [10:18:16] I'm checking with the compiler, not sure that syntax works out of the box [10:24:36] see my comment in ops, the other parts are ok [11:18:19] so, I have a better idea a script that if you run "mysql" uses a local.my.cnf with socket/no ssl, if "mysql" has parameters, use ssl and default config [11:19:15] what if we run salt with many other options? [11:19:38] salt "host" cmd.run "mysql --batch --... -e 'long query'" [11:19:38] we have to make sure we disable ssl manually, if it was needed [11:19:51] that is why I added it to my scripts [11:20:01] --defaults-file=... [11:20:38] basically the idea is meake things easy for newbiews [11:21:16] if they need more complex things, they should know about ssl options, config, etc. [11:22:07] adding full path and default configuration files is something we should do always for proper scripts [11:22:35] we can still do the host thing [11:23:02] but it is complex, because maybe someone has host=localhost on a .my.cnf [11:23:41] yes [11:24:01] look, I am the first one to get problems with that (more complex) [11:24:11] but that is the price to pay for TLS [11:24:26] and that is why I sudo with my own alias :-) [11:24:56] that I'm copying :D [11:24:57] "my" saves me 3 laters + --deafaults-file=bla bla [11:25:15] I have also mysqlbinlog disabling ssl until the bug is fixed [11:25:22] and with full path [11:25:33] feel free to copy and share if you improve those [11:25:51] btw about mysql prompt, sometimes is useful to have the hostname too in addition to the shard [11:26:00] just as a general whishlist :) [11:26:05] it should be there (the local one) [11:26:14] it cannot be the server one [11:26:25] because prompt is a client thing [11:26:32] or you mean as a script [11:26:34] ? [11:27:13] if yes +1 [11:27:33] I also would like timestamp, but takes too much spave [11:27:46] old dba used to have a full line of prompt [11:29:49] the prompt in the [mysql] section of root/.my.cnf [11:30:28] yes, but that needs to be scripted [11:30:44] the host there refers to the client's host otherwise [11:31:09] going to luch, will be back for the es changes [11:31:27] ok [11:31:37] then also teh shard could be mistaken ;) [11:32:17] yes, it assumes that you only connect to localhost [11:33:03] under that assumption the hostname will be correct too :D by yeah could be confusing [12:25:16] let's merge the es stuff and apply it ? [12:26:51] yes [12:29:56] while you are merging, let me re view the weights, too [12:30:05] ok [12:30:27] merge done preparing hot change [12:32:22] not sure if to keep weights or modify them [12:32:42] I will check the graphs [12:34:45] sudo salt 'es101*' cmd.run 'mysql --defaults-file=/root/.my.cnf -A -BN -e"SELECT @@global.thread_pool_size"' [12:34:50] returns 32 for all [12:35:02] go ahead [12:35:20] probably is the other thing that you discovered, anyway [12:35:35] ? [12:35:54] the max threads created [12:37:03] I've noticed something I had not before [12:37:54] what? es101* done, do we want to do es201* after the switchover? [12:38:16] the number of threads connecting is way more in es3 than in es2 [12:38:55] to the point that almost no errors were returned for es2 [12:39:11] so it is either key-dependent [12:39:15] on a single key [12:39:23] or server-dependent [12:39:43] how is the spreading of keys between es2 and es3? [12:39:55] probably key, because it happend both on master and slave [12:40:05] it is md5 or other hash [12:40:28] but maybe perf is right and it only happend for e.g. the main page [12:41:43] is the main page visited so much more than all the others? doesn't make too much sense to me [12:41:57] but it would make no sense it is more than one key, because the difference is huge (less than 2000 spike vs 80000) [12:42:43] it doesn't necesarely has to be the main page [12:42:45] are you multiplying es2 ones by 2/3? [12:43:07] maybe a javascript that is loaded on every page, but it is acually wiki content [12:43:11] or the CSS [12:43:16] those are wiki pages too [12:43:17] sorry I mean 3/2 [12:43:30] the differences is still huge [12:43:56] even having into account that server less [12:44:14] https://tendril.wikimedia.org/host/view/es2017.codfw.wmnet/3306 [12:44:27] vs https://tendril.wikimedia.org/host/view/es2016.codfw.wmnet/3306 [12:44:48] look at threads_connected + aborted clients [12:45:05] the spike is real, do not get me wrong [12:45:14] but way larger on es3 [12:45:43] com_select are the same 2.2M vs 1.5M 3/2 expected factor [12:45:54] it doesn't matter [12:46:04] selects were not executed if they could not connect [12:46:20] you have to add the ones that connected to the ones that tried and failed [12:46:58] what I see is no replication lag (which is what I initially wanted to see) [12:47:23] so I propose setting the weight equal on all 3 (6) servers [12:47:59] (temporarily for the swithover) to have equal capacity for connections [12:48:14] innodb_mutex_spin_rounds spiked only on es3 [12:48:21] not sure if cause or effect [12:48:29] +1 [12:48:32] yeah "InnoDB activity" [12:48:38] but not in the other [12:48:39] does not say much [12:49:00] also Innodb_s_lock_os_waits [12:49:10] it means more selects, in thery, more on the same keys due to locking, but it could be many things [12:49:31] those things are nice when we are fine tuning, and sadly we are not yet there [12:50:15] but it would confirm the "hot key" scenario [12:50:18] FYI: https://phabricator.wikimedia.org/T133265#2226987 es201* can wait after for me, but I can do them now if you want [12:50:36] which means the memcache thing should help [12:50:57] great! [12:51:14] this should help too [12:51:24] I just think that it is not the root issue [12:51:50] we can do es2* later [12:52:04] surely not, the spike is way too much for just that [12:52:28] it's true that a new connection wants a new thread but something got locked there [12:52:33] to justify the surge [12:52:47] I think perfs analysis was fair [12:53:06] it is just that I didn't realize this until now [13:26:13] jynus: I never merged stuff on the DNS repo... what is the procedure? [13:26:36] I can do that [13:26:42] it is not time sensitive [13:26:47] ok, thanks [13:26:48] but it is on the dns page [13:27:26] merge then runnin a script that merges locally and does check before reloading the dns servers [13:27:45] ok, I'll take a look, but better to doing it for the first time today :D [13:27:52] brb [13:31:09] when I kill/start hearbeat could you keep en eye on icinga to see if it likes it? [13:32:40] yes [13:33:06] I am thinking if we should change parsercache read only [13:33:19] together with the others? [13:33:26] if they are in read-only while cache warmup [13:33:41] could that cause issues? [13:41:51] after how much time icinga complains about puppet stopped? [13:41:56] *disabled [13:42:18] 30-minutes or an hour [13:42:34] could not be enough... :( [13:42:46] I think it is a warning [13:42:48] only [15:11:59] good news, no es* tweaks needed [15:12:07] but we will keep those [15:15:18] jynus: see in -ops [15:20:32] volans: the new DIMM arrived [15:20:35] for db1065 [15:21:14] cmjohnson1: great! but too late to shutdown now, we just switched back :D [15:21:32] we can schedule it next week I guess [15:21:56] okay: let's do that...i only have so many days before I have to send the part back [15:22:40] cmjohnson1: if it's urgent you can schedule it with jynus, maybe he can before me [15:22:54] sure [15:22:57] I can do that [15:23:05] just not today [15:23:08] it's not incredibly urgent....the sooner the better [15:23:15] https://phabricator.wikimedia.org/T133250 [15:23:17] lets go for monday [15:23:19] is the task ^ [15:23:28] jynus let's do it Tuesday [15:23:32] if that [15:23:34] better [15:23:37] cool [15:23:44] around this time? [15:23:47] yes [15:23:50] great [15:24:01] tuesday is [18:01:47] jynus: FYI on tendril there are ~200 QPS on db2070 and db2048... [18:02:39] second would be the new backups [18:02:46] or something [18:03:04] 2070 I do not know [18:03:19] ok I'll take a look, labsdb1001 space instead? [18:04:08] nah, you already filed the ticket [21:14:11] I suppose the dbstore disk space warnings are known and low-prio, right/ [21:14:58] yes paravoid, if you want to have a cleaner icinga I'll ack it (non-sticky) [21:15:47] paravoid, there is not much to do but https://phabricator.wikimedia.org/T131363 [21:16:40] jynus: I've opened a couple of task in the next section of the DBA board that I think they need attention tomorrow [21:17:01] tomorrow is a bit optimistic [21:17:01] fyi in case you missed them in the spam of me cleaning up a bunch of tasks too :) [21:17:31] yeah, if they're low-prio and staying around for a while, let's ack them [21:17:47] they're the last outstanding alerts right now, btw :) [21:20:12] no, fermium is complaining [21:20:31] ehehe mail queue [21:20:35] you call it paravoid :) [21:46:11] replag on s2 and s3 in Labs looks pretty bad -- https://tools.wmflabs.org/replag/ -- known issue? [21:46:37] err s2 and s4 [21:49:08] bd808: on which server is this page checking the lag? [21:50:03] it hits each server that meta_p on s7.labsdb reports as the home server for the slice [21:50:09] https://tools.wmflabs.org/replag/?source [21:55:05] jynus: apparently labsdb1001 went out of space before when nagios was complaining was short on space :( some of the replicas stopped [21:56:03] I'll restart it but cannot guarantee complete data coherence [21:59:23] I already did that [21:59:43] they are stopped, s2-s7, starting now [22:00:02] I restarted s1 [22:00:02] thanks for look folks. I know its been a long long day for you [22:00:15] the others are not used [22:00:24] only s1? [22:00:36] lower priority [22:00:58] there are 667G now, can I restart them? [22:01:07] bd808: then your tool is checking something else [22:01:08] yes, s1 is now up to date [22:01:43] because on labsdb1001 where stopped all but s1 that is in sync [22:02:42] no, those checks are correct [22:02:48] hmmm... it should be checking the heartbeat_p table on each shard by hostname based on dns [22:02:50] they check the in use-shard [22:02:59] yes, it does [22:03:01] so why complains only about s2 and s4? [22:03:14] because the others are on labsdb1003 [22:03:48] hence lower priority [22:05:15] ok [22:06:48] it will happen again if people do not clean up their databases [22:08:11] jynus: is there an easy way we can make a tool that shows the top N user databases by size? [22:08:26] so I have something to point at and whine about [22:14:24] <_joe_> bd808: du -sh ? [22:14:29] <_joe_> :P [22:14:37] --max-depth=1 [22:14:48] <_joe_> du -sh * does that [22:15:04] <_joe_> -s is --max-depth=1 :) [22:15:25] isn't -s --sumary [22:15:29] <_joe_> anyways, I'm going to bed :P [22:15:40] <_joe_> jynus: yes sorry, too late [22:15:51] so -s is max-depth=0 [22:15:55] <_joe_> yes [22:16:05] <_joe_> you have to be in the mysql dir :P [22:16:06] but you use * to see it for each subfir [22:16:26] du -csh /srv/sqldata/* | grep G :-P [22:16:28] <_joe_> well, all the file entries too ofc [22:16:41] <_joe_> volans: grep T :P [22:16:51] 0/10 will not buy [22:16:56] depends :D [22:17:04] bd808: I assume without shell access right? [22:17:12] ideally, yes [22:17:44] I guess I could create some monitoring script that is cron'd on the db servers by puppet [22:18:00] SELECT table_schema , sum( data_length + index_length ) / 1024 / 1024 as "Size in MB" FROM information_schema.TABLES where table_schema regexp '^s[0-9]' GROUP BY table_schema; [22:18:00] send the patch, I will review it [22:18:07] and similar for the other format [22:18:09] no, please, volans [22:18:14] :-P [22:18:25] we do not want that on 40.000 tables [22:18:35] db size per service group is something I'd like to track generally for T129630 [22:18:35] T129630: Collect and display basic metrics for all tools (service groups) - https://phabricator.wikimedia.org/T129630 [22:19:14] jynus: should be only on user ones [22:19:18] not all wikis [22:19:18] ok [22:19:31] 32 rows in set (0.25 sec) [22:19:39] ok ok [22:20:01] although... s51187__xtools_tmp was having 10000 tables, now down to ~2k I think [22:20:19] meanwhile, I cannot netcat into db1041 [22:20:31] and yes, I opened a port [22:20:34] which port? [22:20:35] :D [22:21:47] I don't see something listening [22:23:01] I removed it already [22:23:07] it got dropped by ferm [22:23:21] pr 21 22:08:53 db1041 kernel: [10992211.800514] iptables-dropped: IN=eth0 OUT= MAC=78:2b:cb:03:6f:71:5c:5e:ab:3d:87:c2:08:00 SRC=10.64.32.20 DST=10.64.16.30 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=3066 DF PROTO=TCP SPT=48665 DPT=4444 WINDOW=29200 RES=0x00 SYN URGP=0 [22:25:51] from bast1001 should work [22:27:48] it was my rule, on the wrong order [22:28:03] it was logging and dropping before reaching it [22:28:17] tcp connections work fine, it is mysql-only [22:28:17] -A instead of -I [22:28:27] the other way round [22:28:45] with -I the one you wanted :D [22:29:11] what did you put there for testing? [22:29:30] a while true nc listening/ [22:29:31] ? [22:29:56] yes [22:30:05] at the same time, did the same for mysql [22:30:15] mysql stalls, nc not [22:31:31] what happened to tendril graph for db1041? [22:32:07] I wiped it [22:32:31] * volans just lost an heartbeat... [22:32:58] tell me when you do those things :) [22:33:56] for the nc test... where you sending back something? [22:34:08] no [22:34:08] because the slowness was not in the 3-way handshake [22:34:22] but the packets sent back [22:34:38] I send forward [22:34:54] obviousy the acks return [22:34:58] but not data [22:35:09] we could try with python -m SimpleHTTPServer in a directory with a test file and download that [22:35:23] no, I think this is proof enough [22:35:33] I can test it the other way, too [22:40:39] seems that it also lost the connection with it's master db2029 ~1h ago [22:40:44] 160421 21:35:13 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) [22:40:48] then reconnected [22:40:58] no, that is me restarting the slave [22:41:17] ok [23:26:06] it is the pool/thread handling, almost sure [23:26:26] waiting for free thread? [23:26:39] free in the sense of CPU free [23:29:32] https://phabricator.wikimedia.org/T133309#2229489 [23:29:43] it does not happen on 3307 [23:30:09] either there is a strange firewall rule that only affects 3306 [23:30:16] saw just now, do you want to try one thread per connection? [23:30:27] cannot in a hot way [23:30:27] or just tweak the thread pool [23:30:36] was the first thing I tried [23:43:19] I don't get why it was not having those issues 1 week ago [23:43:28] mysql was not even restarted there [23:58:43] * volans going to bed, for reference db1033 has one-thread-per-connection :(