[06:50:12] <_joe_> jynus, volans I prepared reverts for all changes we made during the switchover [06:50:26] <_joe_> the only one I could not revert is https://gerrit.wikimedia.org/r/#/c/284144/ [06:50:39] <_joe_> for which your scrutiny is probably needed [06:50:55] <_joe_> writing to ops@ [07:03:32] _joe_: that doesn't need to be reverted, we will merge https://gerrit.wikimedia.org/r/#/c/283771/ today and then for tomorrow we just need to swap the $master true=>false and false=>true, we are failovering all old eqiad masters today [07:08:14] I hate aNag for android when it get stuck refreshing in the middle of the night, seems to not have a timeout :( [07:08:28] I'll check/open a bug for them... [07:09:03] * volans getting a shower, be back soon ready for the failovering [07:11:19] I answered on the mail,_joe_, I didn't see the message here first [07:12:44] I've answered there too :) [07:12:55] didn't saw your reply when hit send [07:14:06] jynus: for T133122 ok for the insert ignore, but given the ID is an auto_increment, they are not misaligned there due to the missing rows of tonight? Following inserted rows have IDs that are in your dump [07:14:06] T133122: Backfill data into db1065 and db1066 - https://phabricator.wikimedia.org/T133122 [07:14:26] I can do a quick check getting the IDs from the dump and checking [07:14:45] ids should be correct as they are sent on STATEMENT based replication [07:15:03] if they were not, we have a really bad problem [07:17:00] how being an insert...select? [07:17:14] well, the insert-select was not sent [07:17:22] subequent insert were [07:17:45] mysql -BN -h db1052.eqiad.wmnet enwiki -e "SELECT max(rc_id) FROM recentchanges"; mysql -BN -h db1052.eqiad.wmnet enwiki -e "SELECT max(rc_id) FROM recentchanges"; [07:17:45] 817480596 [07:17:45] 817480596 [07:17:48] true, you mean all the others from usual softwar are properly inserted [07:18:12] change the second 65 for 52 [07:18:16] same results [07:18:27] that doesn't mean that they are not dangerous [07:18:56] 2 consecuitive insert selects with different data, and we are screwed [07:19:13] yes [07:19:24] specially if we have different locking patterns [07:20:10] I had problems in the past with that and using pt-table-sync/pt-online-schema change [07:20:37] I will review and deploy the change, ok? [07:20:45] The certs one [07:20:59] ok, I didn't had time to review the puppet compiler diffs, link in the CR [07:21:17] I'll get a shower and be back [07:24:23] I'm ready to merge the patches to enable base::firewall on the eqiad mariadb::core's whenever works for you, just give a "go" so that I don't interfere with the other maintenance work you're doing [07:26:23] actually, mortiz, we should merge it with the patches we need to do or apply them now and we rebase [07:26:47] probably the second will be easier to apply and debug [07:26:59] can we do it really now? [07:33:55] sure, let's do it now [07:34:01] any particular order you'd like? [07:35:05] same I reviewed [07:35:12] first one is already rebase [07:35:31] lets go to ops- [07:36:43] ok [08:04:56] jynus: for s2, do you want to failover db1018 to db1024 too? If you do want I need to add few lines on puppet [08:07:53] I do not know [08:08:06] whatever it works [08:08:46] whatever is your goal for this week, if you want db1024 as master I'll make it a master :) [08:09:29] I do not want db1024 as a master, the goal is have new TLS certs deployed [08:09:42] if that requires db1024 as the master, so be it [08:10:03] whatever is less dangerous [08:10:30] if we restart db1018 after merging puppet we'll have the new ones [08:12:40] then that is ok [09:07:56] hey jynus ! Just wondering if you could take a look at https://phabricator.wikimedia.org/T130067#2133384 and reply to it? :) [09:08:17] I'm not sure if them ebing together is possible / would even make sense [09:08:56] actually, this may be one of the changes to potentially do now [09:09:27] but it probably will not fit, and will have to wait until next failover [09:09:33] that is why I said 6 month [09:09:52] :D [09:09:58] changing a primary key is not easy, once it has a primary key, all other changes will not get blocked [09:10:04] and do not need a faiover [09:10:19] it is the PK what block everything [09:10:29] you can paste this there if you want [09:10:30] yup :) thought so! [09:10:33] (hope it is clear) [09:10:33] awesome, will do! [09:13:51] also regarding clearing a users entries from the watchlist table, even with the wl_id in place deleting in batches of 1000 would still be prefered? [09:16:01] yes, although the PK will make the change faster/less troublesome for replication [09:16:35] if we had a proper framework, we should implement a "decaying time" window [09:17:34] run 1000 updates on PK- if it takes way less than 1 second, double it; if it takes more, half it [09:18:19] okay! [09:18:23] also, once we have mariadb 10 slaves, index alter will be transparent [09:18:38] *masters, I mean [09:18:57] and in general, things like alters will be faster [09:19:19] I just cannor promise to be done this time (it will probably won't) [09:19:29] (the specific alter) [09:21:54] yup, okay! :) [09:22:56] volans, you are going to laugh [09:23:12] but now we have 2 more rows on api servers than on the other hosts [09:23:13] tell me [09:23:24] rotfl! told you :-P [09:23:26] the import went smothly [09:23:37] the count is ok: 2914 [09:23:50] but the other servers now have 2912 [09:23:57] so they have deleted 2 rows [09:24:29] compared to before [09:24:37] AFAIK those are deleted after like 2~3 months, but maybe you can "undo" a recent change? [09:24:53] I do not think so [09:25:01] rcs I think are only there for a month [09:25:52] I mean that those of yesterday should not get deleted from what I know [09:25:55] but it's very little [09:26:20] (my knowledge of the underlying application logic involved_ [09:29:34] <_joe_> hey, I just put down https://etherpad.wikimedia.org/p/eqiad-switchback, whenever you have time, phase 6 needs heavy editing by you guys [09:30:10] yes, we know [09:30:22] <_joe_> you knew I did that? [09:30:24] <_joe_> :P [09:30:37] 817302324 817301655 are missing [09:30:48] I will delete before something else breaks [09:30:58] <_joe_> oh you weren't respondig to me but to volans, ok :) [09:31:22] _joe_: I would add in parentheses who has to do it a given command to, thanks, I'll take a look at it [09:31:26] jynus: go ahead [09:31:39] we are in the middle of something not very important, _joe_ just breaking eqiad database servers [09:31:47] *masters [09:31:54] <_joe_> jynus: yeah I didn't need an answer [09:32:03] <_joe_> that was just a notification [09:32:59] :-P [09:33:20] so those could be hidden edits [09:33:31] e.g. deleted articles, or made them hidden [09:33:48] in which case they could have or not broken data consistency [09:34:19] at this point we should consider those tainted, given that mediawiki tends to do unsafe statements [09:35:51] if there are unsafe statements all can be tainted... [09:36:08] yesterday script was a one-off to fix the issue during the switchover [09:37:35] so if the dump of lines in that timeframe is the same between 65/66 and the other slaves/master I'll tend to assume they are tainted as they were before :) [09:38:25] my assumption is that we don't have other INSERT ... SELECT from recentchanges into other tables [09:38:39] if we have... than they have wrong data [09:40:54] https://phabricator.wikimedia.org/T133122#2222196 [09:43:25] doing a diff only one row different [09:43:59] a field 0 and the other 1 [09:44:02] id 817303249 [09:45:22] no, wait... [09:45:56] yes [09:46:09] yes rc_patrolled [09:49:25] so summary: kill the query? good call [09:49:46] but better not skipping- creating the index or whatever [09:49:57] if you didn't have the time, is ok, leave it lagging [09:51:46] agree, was 1am and probably not my best call, I was worried that would have lagged too much, blocked us today for maintenance stuff and don't recover by tomorrow for the switchback [09:52:03] and yes creating the index is probably the best thing [09:52:09] it was eqiad == not real traffic [09:52:17] hoping it will not affect existing query plans [09:52:28] it eqiad was primary, yes, you did well [09:52:37] as it would have impacted availability [09:53:00] the actual summary is that we need to fix schema drifts [09:53:06] and we need help for that [09:53:15] I am reviewing the change [09:53:36] ok thanks [09:54:21] no touching yet the 5.5 masters, right? [09:54:32] <_joe_> btw, a few people on VP:T are complaining about db issues https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#More_than_usually_buggy [09:54:36] not at all [09:55:10] _joe_, I saw those [09:55:23] the problem is that mediawiki db errors are usually not db-related [09:55:23] <_joe_> ok, I wasn't sure [09:55:28] <_joe_> ehe [09:55:56] <_joe_> when in doubt, blame the db [09:56:35] I mean, if a row gets locked, it is a mediawiki logic- of course I can help, but the fix would be on mediawiki queries [09:57:06] it just need long-term profiling of mediawiki [10:01:08] ok let's do it [10:02:26] ok [10:03:23] running puppet on db1057 to test it [10:03:28] I mean, verify it [10:03:40] not merged yet [10:03:44] wanted you to be around [10:03:57] right... I was ahead :) [10:04:05] done now [10:04:12] db changes are safe [10:04:20] because they only write to the file [10:04:25] mostly [10:04:36] I just want to avoid the spam in the channel if something is wrong [10:05:43] db1057 and db1064 (different type of changes) run smoothly [10:07:18] I trust the merges [10:07:37] it is the replication channels that could fail on restart/change master [10:08:51] so for the large wikis (s1, s4, s5), I am thinking of preserving at least an API node if they have the old (.15/.16) query rewrite plugin in use [10:08:52] binlog_format is MIXED on all designated masters [10:09:21] I can change it to STATEMENT now on all [10:10:18] e.g. https://phabricator.wikimedia.org/P2934 [10:10:49] yes [10:10:53] we could have any issue with delayed dbstore1001? do we plan to change it tomorrow continuing to replicate from old masters for the next 24h? [10:11:03] now or before topology change, whatever you prefer [10:11:15] but before puting labs as a slave [10:11:30] I would pause that for now [10:11:42] as in, leave it in third tier [10:11:53] we can mange it after the failover [10:12:05] ok [10:12:12] or if we have the time [10:12:47] so what are you starting with? [10:14:02] s7? [10:14:38] ok, I will start with s2 [10:14:57] jynus: I tested requests to enwiki (s1), itwiki (s2) commons (s4), dewiki (s5), frwiki (s6) and rowiki (s7) from an eqiad appserver (mw1150), went all well. no idea how to test the es1 shard, though [10:15:08] and dawiki for s3 [10:15:39] jynus: s2 you need to do a manual CHANGE MASTER TO to empty the old certs [10:16:05] moritzm, I think that is more than enough [10:16:09] ok [10:16:11] on db1018 only of course [10:16:12] thank you very much for your help [10:16:25] yes, I saw you changed the designated master [10:16:37] ? [10:16:52] I left db1018 that was already a 10 master [10:16:54] sure, let me know if I can help with anything else during today's window [10:17:13] * jynus says, this younglings, I was killing eqiad servers before you were born! [10:17:57] * jynus you know, all this parameters, master, ssl, etc. I created those, there were not parameters at all before! [10:18:07] * jynus those were good times! [10:19:05] lol [10:19:15] it needs restart anyway for the "internal" cert [10:19:24] for its slaves [10:20:10] I think I will do that and silence all replication alerts [10:20:45] db2017 -> db1018 replica is using old certs, db2017 has the special CA cert that accept both old and new, see show slave status on db1018 [10:21:27] and db2017, that is using the old cert for connecting to db1018 but was restarted with the new ones [10:21:31] yes, but we want to restart the master if possible to use only the new one, and so its slaves [10:21:54] yes, you need to do a stop slave on db2017 too and change master [10:22:04] the slaves will have to use the old one unless db1018 is restarted [10:22:23] to reset the SSL parameters set manually to the path of old cert [10:22:50] for replication that works, but that requires master restart- and we did not restart db1018 [10:23:01] you did it on a diferent host [10:23:34] I think we are saying two different things :) [10:23:39] no [10:23:51] I know what you are saying [10:24:07] but that will work on all shards execept s2 [10:24:24] no, I'm saying that s2 has special steps, exactly for that [10:24:57] including a restart of db1018 [10:25:12] of course [10:25:18] ok :-) [10:25:19] see in the etherpad: # Ensure there are not custom certs on [10:25:26] then we agree! [10:25:28] :-) [10:25:31] :) [10:25:44] if we have the time, we can restart the others to get rid of the dual cert [10:25:50] but only if we have the time [10:25:57] not now [10:26:12] each one to its task! [10:26:14] and I suggest to stop also the replica from db2017 before restarting given that you have to change master there too [10:26:19] ok [10:26:37] I didn't changed the binlog_format to all, so let's do one by one [10:47:26] jynus: so full upgrade or just mariadb package? [10:52:10] if you are restarting mysql, or need to upgrade, do it of all packages, it worked well for me on other host, and there are some pending upgrades we could not do [10:52:31] so restart the host too... [10:52:47] for the kernel? [10:52:50] again, only on those where a simple change master is not enough [10:53:20] I think it is ok to restart, just do it one by one because there is a change they won't come back [10:53:25] *Chance [10:53:48] on eqiad slaves we need to restart mysql on all if you want them to work out of the box, otherwise just a stop slave change master to to use the new cert; start slave will work too [10:54:10] I thought we want to restart all of them to upgrade them too, but maybe I get it wrong [10:54:13] what I would do is do that first [10:54:26] (only the change master) [10:54:38] then, if there is time, the full package [10:54:42] ok [10:54:50] but TLS is the priority [10:55:12] the question is that older ones will require restart [10:55:23] (of mysql), for upgrade [10:55:38] or SSL will not work at all [10:56:05] of course [10:57:51] BTW, that is expected but even if the default cert is changed, replication still uses the old one [10:58:08] openssl support is from >= 10.0.16 or >= 10.0.22? [10:58:15] 22 and 23 only [10:58:27] what do you mean? (line above) [10:58:35] >= 10.0.22 [10:58:40] no the one before :) [10:58:47] no [10:58:52] in fact [10:59:01] it may be 10.0.22-2 [10:59:10] and not in 10.0.22-1 [10:59:18] due to a compilation problem [10:59:21] ok, I'll check that [10:59:32] there should not be 22-1s [10:59:41] but just in case [10:59:44] my what do you mean was for "that is expected but even if the default cert is changed, replication still uses the old one" [10:59:57] sorry [11:00:22] I have to run CHANGE MASTER on db1019, for example [11:00:29] *db1018, sorry [11:00:47] even if the global configuration is now the right one [11:00:57] to force the new certs [11:01:10] just to remove the old set value for the CA [11:01:25] yes, on the master.info / SHOW SLAVE STATUS [11:01:43] it was a heads up to check it [11:01:51] but it may be s2-only [11:02:17] just MASTER_SSL = 1 on the others should work well [11:02:29] it shouldn't be needed even on s2 [11:02:59] you should set to empty Master_SSL_CA_File, Master_SSL_Cert and Master_SSL_Key and if was restarted it will get the new ones [11:03:05] from my.cnf [11:03:07] no [11:03:14] ah [11:03:15] yes [11:03:17] mm [11:03:19] not sure [11:03:22] I think yes [11:03:35] but I have force it, just to be sure [11:04:19] I think in some cases, CHANGE MASTER may use the old values instead of the default ones [11:04:37] if they were not set manually should not [11:04:38] better safe than sorry, and we can run in again at any time [11:04:58] ^ [11:05:24] it is the codfw -> eqiad replication after all, we can change it later at any time [11:06:15] I am going to not touch the s2 slaves yet (as this topology should be already right) [11:06:21] and going for another master [11:06:54] ok, the designated master on the other shards should not need any action beside binlog_format STATEMENT [11:07:03] only the slaves needs work [11:07:06] yes [11:07:14] as in you are right [11:09:01] error reconnecting to master 'repl@db1018.eqiad.wmnet:3306' - retry-time: 60 retries: 86400 message: SSL connection error: error:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown ca on db2017, not sure if expected or something wrong [11:09:25] expected [11:09:30] it is using the old certs [11:09:31] where? db1018 or db2018? [11:09:34] db207? [11:09:36] 2017 [11:09:41] ^that one [11:10:05] so executing change master there too [11:10:24] yes is expected you didn't change master to remove the old certs values [11:10:31] Master_SSL_Cert and Master_SSL_Key [11:10:42] set them to '' [11:10:50] should I try now wihout- exactly [11:11:02] doing that [11:11:23] that's what I was trying to say before, probably I didn't explain myself :) [11:13:58] CHANGE MASTER TO MASTER_SSL_CERT='', MASTER_SSL_KEY=''; -- to be precise [11:14:20] I confirm that works: MASTER_SSL = 1, MASTER_SSL_CA='', MASTER_SSL_CERT='', MASTER_SSL_KEY=''; [11:14:37] with the 1 to force SSL [11:14:43] was alredy 1 [11:15:19] I cannot remember, but it is one of those things I prefer to set always, just to be sure [11:15:36] I just look at Master_SSL_Allowed :) [11:19:03] I will be taking a break now and continue with another shard later [11:19:27] ok [11:19:58] s2 is done except within-datacenter traffic [11:27:28] * volans lunch [12:05:16] * volans back [12:25:16] I will start with s3 now [12:26:00] ok jynus, a 5.5 slave of a 10 will work? [12:26:23] "yes" [12:26:46] "ok" :) [12:26:52] I've done it for us, and we are actively doing it, for example on x1 [12:27:12] or, you know, on every single s* shard [12:29:37] true since yesterday :) [12:32:39] nice (s7)! https://tendril.wikimedia.org/tree [12:33:02] s7 completed just now [12:33:34] and db1033 is running :-) [12:33:52] eheheh [12:34:00] I've put the commands executed on the etherpad [12:34:08] I added the last 2 at the bottom [12:34:12] to complete the switch [12:36:22] I'll go with s6 then [12:41:51] did you change codfw's master? [12:42:07] or are they now a circle of 3? [12:43:44] codfw masters were already replicating from designated master [12:43:51] I'm stupid [12:44:12] * volans too, went to check ... [12:44:12] sorry [12:45:03] this is one of the thigs that unders normal circusntances would be easy [12:45:13] but with so much legacy, it gets confusing [12:45:27] yeah [12:45:27] I am glad I have you here [12:48:06] thanks, I'm glad to be of some help [12:50:46] s6 done [12:54:31] doing s5 [12:59:06] s5 done [12:59:51] doing s4 [13:03:57] s3 done [13:04:13] I suppose that leave s1 [13:04:17] *leaves [13:04:24] yep, last one [13:05:02] and last one with MIXED [13:08:22] s4 done [13:10:46] jynus: are you doing s1 or I do it? [13:11:05] I am [13:11:16] ok [13:11:22] next, I would, either check es/x1 [13:11:27] or do restarts [13:11:40] whatever you think will me more urgent [13:12:25] es/x1 do not have yet "designated" [13:12:58] so they would need puppet + s2-like treatment [13:13:14] no STATEMENT, though [13:14:46] I would like to take care of db1070/71/65 for the heating with chris, given we need to shutdown I'll also do the upgrade there [13:14:54] sure [13:15:04] let me do s1 quicky, then [13:15:42] as those will be down, upgrade them when I finish [13:16:31] sure I need to find Chris before :) [13:20:51] T105135 can probably be resolved now :) [13:20:52] T105135: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135 [13:21:20] :-) [13:21:30] I predict around 20 tickets [13:21:47] will be closed or now closed very quickly [13:21:59] you are very lucky to see an actual change [13:22:14] usually this kind of changes take months to prepare [13:22:25] well, literally it took years to upgrade to 10.0 [13:22:26] yeah [13:22:42] I upgraded a good chunk of those from 5.5 [13:22:52] slaves used to be on 5.5 tooª [13:22:54] ! [13:24:23] one s1 slave is lagging, blocking the change [13:24:30] well, several, one is lagging more [13:25:09] all 0 now [13:25:17] no, it is a lie [13:25:27] check db1047's graph [13:26:00] tendril does not yet use pt-heartbeat, and the lag it shows when it shows 1 number is a bit of random or something [13:26:02] db1047 is always a bit on the edge [13:26:13] it is not a production db [13:26:22] it is in reality an analytics-slave [13:26:22] multisource [13:26:25] yep [13:26:36] plus you remember hardware issues, locking,etc [13:26:41] *ay [13:27:33] I may leave it there [13:27:36] for now [13:29:19] ok [13:30:53] do we have slaves to restart on eqiad that we cannot do when active due to weight/specific role? [13:30:59] so we can give priority to them [13:32:10] in general no [13:32:22] as in, there should not be single slaves that are SPOF [13:32:23] but [13:32:31] there are some more difficult than others [13:33:16] in particular, the ones with more load (db107[23] on s1) and the ones with role rc,logger that are only 1 [13:33:55] such as db1019 [13:34:07] new ones (70s) I guess are already updated recently [13:34:14] and db1026 [13:34:36] not really, as they were more loaded, it was more difficult to depool them [13:34:45] use the version for that [13:34:59] if it is 22/23 do not touch them [13:35:24] but if you mean >db1073, yes, those are already in jessie [13:35:33] not sure about the cert, though [13:35:52] for s3 they have it [13:36:02] but the new one? [13:36:19] oh, yeah [13:36:24] 21:04 logmsgbot: volans@tin Synchronized wmf-config/db-eqiad.php: Repool new db1075,1077,1078 after TLS upgrade on s3 - T111654 (duration: 00m 36s) [13:36:24] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [13:36:25] we can change it dynamically [13:36:27] from SAL [13:36:46] great [13:37:23] I would focus on what I said before: large weight and old servers [13:37:28] both things [13:37:33] at the same time [13:37:43] (old versions) [13:38:11] db1072 and db1073 [13:39:41] do you want to check those while I give a look at x1 [13:39:43] ? [13:40:48] remember the dump_now/dump_at_shutdown-preciselly the older ones may be the ones where newer configuration may not have taken effect [13:41:05] yes of course [13:41:09] ok I'll start with them [13:41:19] I know I should have change them dynamically [13:41:43] but as usual, time + risk of changing something on all servers at the same time [13:42:49] CR on your way [13:47:22] cannot see it [13:47:33] do you trust db2019, them, should I pool it? [13:47:40] *es2019, I mean [13:49:17] in all my checks I didn't find a diff, and checking also with ori seems that the logic I've used is correct [13:49:34] then I will repool it [13:49:35] so I should have checked the right blob_ids, although I cannot be sure 100% about it [13:49:43] it is ok [13:49:59] cr is https://gerrit.wikimedia.org/r/#/c/284455/ I was having a conflict :) [13:50:01] in fact, the other reason why we depooled it is because a random crash [13:50:14] which hasn't happened since [13:50:40] true and was without load when crashed, just replica [13:50:45] like all the rest of the time [13:50:50] depool the old masters [13:51:02] we do not want them receiving reads [13:51:11] unless we really need them [13:51:43] ok, I'll tweak a bit the weights then [13:52:10] for a couple of shards the designated master was doing vslow/dump [13:52:20] mmm [13:52:25] I do not like that much [13:52:55] because those have crazy fragmentation [13:53:05] but I am complaining without a real alternative [13:53:11] s4,s5 [13:53:25] I would have done the same [13:54:06] I think you have done it :) in the sense the almost all of them where the ones chosen for the first TLS replication [13:54:17] yep [13:54:33] :-) [13:55:04] but if we had 1-2 more weeks we could have reimaged them with jessie [13:55:12] during the swithover [13:55:29] I'd suggest the next switchover to be a week, if possible [13:55:43] if not too distrupting for deployments [13:55:55] do not tell me, tell faidon/mark- or write it on the etherpad [13:56:04] or a good test to deploy while on codfw too :D [13:56:20] I would have needed more time to do schema changes, too [13:56:31] that is why I asked for the next one [13:56:41] what's up? [13:57:02] we want longer switchover :) [13:57:02] sorry, I didn't know you were on the channel, pinged you without wanting it [13:57:31] you should add [next time], volans [13:57:55] true :) was a general thought not related to this one [14:07:11] for s6 I cannot depool old master , is the only one together with db1061 and a bit of db1022 that is API [14:07:44] ok [14:08:43] and for s7 I dunno the weight [14:08:55] if the 2 existing ones can handle it [14:14:14] API. rc and dump usually have little load [14:14:23] the problem is that they have spikes [14:14:35] so we isolate *from* them [14:14:41] rather than isolate them [14:14:41] there is not API in s7 :) [14:14:53] (dedicated api) [14:15:01] so it get's shared on all other servers [14:15:14] but yes, maybe we should add some [14:15:38] it is difficult to say something without having some actual traffic [14:17:14] sent second attempt ) [14:17:16] :) [14:17:18] Re: we may be able to triplicate capacity on es1* by increasing the weight of the master [14:17:43] so we have 10000 available connections on each of the 3 servers, including hte master [14:18:05] could be wnough [14:18:09] *enough [14:18:27] we could also increase the "36" pool of connections to a bit more [14:25:57] 36? [14:26:29] pool-of-connections concurrency [14:26:51] pool-size or something like that [14:26:56] so for overheating server I'll them with chris tonight, he'll be available starting from 7:30pm our time [14:27:42] ok [14:27:49] then I will continue with x1 [14:28:20] we can increse the thread_pool_max_threads, is 500 now [14:28:42] from mariadb 10.1 the default moved to 1000 [14:29:12] thanks for the +1, I'm merging and syncing db-eqiad.php then [14:29:18] we can, I just do not think it is a reason for the bottleneck- it was reaching max_connections [14:29:41] I think I saw that too, let me check [14:30:31] 14:23:23 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation. [14:31:59] is that 17? [14:32:57] 2018 for sure [14:33:39] let me check 17 [14:34:15] too [14:34:21] it could be that 17 got overloaded, was automaticly depooled, then es2018 failed too [14:34:22] much more than 18 [14:34:32] 18 has it only ~3 times [14:34:35] the ratio was 3:1 in weight [14:34:54] let's change it then [14:35:13] on all or just eqiad ones? [14:35:18] on all [14:35:34] rand(1000,10000) [14:35:46] that's my suggested value :) [14:35:54] not too large [14:36:06] it still should be lower than max_connections [14:36:07] 2k? [14:36:42] max conn is 5000 [14:37:01] also thread_pool_min_threads ? [14:37:50] I don't find it [14:37:55] https://mariadb.com/kb/en/mariadb/thread-pool-in-mariadb/ [14:38:02] in the server :) [14:38:27] windows only :D [14:38:36] ha [14:38:49] thread_pool_idle_timeout is kinda the equivalent on linux apparently [14:38:57] it's 60 [14:39:08] it is ok [14:40:11] I would set max_connections to 10000 (because it worked for me on API servers) [14:40:36] threads to 2k(?) [14:41:11] not sure about thread_pool_size, up to 40, like the number of cores? [14:41:42] in percona 5.5.35-33.0 they put the default value of thread_pool_max_threads to 100k!!! [14:41:49] I'm reading to see why [14:41:59] it is a different implementation [14:42:04] I think [14:42:12] it was also in beta at that time [14:42:14] could be a bug [14:44:29] ok for number of cores = 40 [14:44:34] for thread_pool_size [14:45:02] https://gerrit.wikimedia.org/r/#/c/284462/ [14:45:49] go ahead, I'll rebase and merge after [15:01:41] * volans taking a break, be back for the meeting (and have scheduled work later too) [15:55:47] * volans back [16:59:46] volans: hi! [17:00:03] hello :) [17:00:51] so, I'm preparing those three to get ready for you cmjohnson [17:01:09] okay [17:01:54] how much downtime do you need? [17:01:58] (for icinga) [17:02:34] jynus: if you agree for db1065/70/71 given we need to shutdown them I'll go with full upgrade [17:02:44] yep [17:02:50] STOP SLAVE [17:03:02] then you can do whatever you want :-) [17:03:06] :) [17:04:05] volans 10mins per server max [17:05:28] ok [17:06:03] jynus: I can see some connections there... [17:06:05] can you like upgrade the kernel before? [17:06:45] u:wikiadmin db:wikidatawiki (db1070) [17:07:34] and tendril show some QPS on db1065 too but I cannot see open connections [17:07:41] strange [17:07:43] yes [17:07:51] some activity is normal [17:07:57] 74 QPS? [17:08:03] due to LVS doing http requests [17:08:10] ah for the api [17:08:12] but it is like 3 QPS [17:08:36] maybe start the others [17:08:45] I will try to see where this comes from [17:08:47] both 65 and 66 have ~74 QPS (both are API) [17:08:49] ok thanks [17:09:02] for db1070 I should be worried by the 2 connections from wikiadmin? [17:09:06] ah snapshot1006.eqiad.wmnet [17:09:13] it is the backups [17:09:25] I suppose it is dump? [17:09:41] if it is, deploy dump to another server, then depool and just restart [17:09:51] dumps were not migrated [17:09:54] 70/71 no, regular slave [17:10:04] 65 is API [17:10:08] problem is ariel is not here [17:10:27] it could be wikidata dumps [17:11:02] ok I'll start from 71 then :) 1QPS and no connections [17:11:05] and if you say why did no tell me about this, I learned at the same time than you- now [17:11:14] lol :) [17:12:15] /usr/bin/php5 /srv/mediawiki/php-1.27.0-wmf.21/../multiversion/MWScript.php fetchText.php --wiki wikidatawiki [17:12:23] there is definitelly a dump going on here [17:12:31] why it is not on a dump host I do not know [17:12:53] great, do we have an ETA? if ~1-2h I can wait if cmjohnson is available later too [17:13:04] if you are sure it is not a dump host, it can be killed [17:13:20] let me check was not before my last change [17:13:22] that is what ariel told me, never wait because those will retry automatically [17:14:18] no is not dump and was not dump before my change of the masters [17:14:35] yes it is doing the dump of wikidata on that host [17:14:50] maybe at some point we restarted the dump host and it failover to other [17:14:55] * volans starting upgrading db1071 in the meanwhile [17:15:01] just stop [17:15:13] no issue with that, but maybe a bug [17:15:29] ok db1070 too [17:15:40] for 65 I'll wait for your ok [17:17:04] jynus: do you want upgraded kernel or new kernel? (upgrade vs dist-upgrade) [17:18:15] new kernel [17:18:38] 4.4 or 16(?) for trusty [17:19:00] do it on all, dumps have failovers, not worries [17:19:00] mmmh seems to install 3.13.0-85 [17:19:15] it is ok [17:20:15] cmjohnson: sorry for the delay, db1071 will be ready in a couple of minutes for you [17:20:21] ok [17:20:43] do you take care of shutting it down? [17:20:55] i can...or just do a shutdown when you're ready [17:21:02] and ping me [17:24:25] cmjohnson: shutting down now db1071 [17:24:40] once dead go ahead, I'll get ready the next one [17:24:52] okay [17:28:11] volans: cpu1 thermal paste is nearly non-existent [17:28:28] that can explain the overheating then :) [17:35:55] volans: db1071 powering up now [17:36:23] cmjohnson: great, I'll wait it come up before shutting down db1070, just in case didn't liked the upgrade [17:37:13] so you think the missing thermal paste is likely the cause of overheating? [17:37:28] we'll find out in a few days [17:37:33] :) [17:37:54] it is likely...we've had a rash of servers with thermal paste issues [17:37:57] pinging now [17:38:14] I'm in [17:39:41] cmjohnson: db1070 shutting down now... all yours once dead [17:39:46] cool [17:40:38] jynus: did you had a chance to look at db1065 for the unexpected QPS? [17:41:22] snapshot1006 is complaining now [17:41:34] so it must have harcoded ip or something [17:41:42] is not retrying? [17:41:57] it is, that is causing the errors [17:42:01] :-) [17:42:17] so will stop complaining when completed :D but yes it's a bug [17:42:59] 65 is the LVS pings [17:43:10] for some reason it is pinging always the same host [17:43:13] but it isok [17:43:27] ok, so I can proceed with it too [17:43:30] it should ping another, and even if it doesnt, we do not cate [17:43:39] it is pasive [17:44:41] ok [17:44:56] I'm putting SSL=1 when restarting [17:45:03] great [17:45:38] some of those may not have proper ssl support [17:45:51] also after the upgrade? [17:45:55] mmm [17:46:03] we should have not upgraded the APIs [17:46:07] but too late [17:46:18] it is ok, it should happen at some point [17:46:28] 1070 is not api [17:46:44] what about the others? [17:46:49] sorry 1070/1071 are not api [17:46:55] did you upgraded them? [17:46:55] 1065 I didn't upgraded it yet [17:46:56] 65 [17:47:00] ok, then wait [17:47:03] for that [17:47:04] ok [17:47:08] the other, no issue [17:47:23] there is like bazillion of issues with query optimization on apis [17:47:34] and I do not want to touch them [17:47:34] I can also just shutdown it for chris and then restart it [17:47:37] no upgrades at all [17:47:42] yes, not issues [17:47:53] you can actually upgrade it [17:47:59] except mariadb [17:48:04] SSL works with Ubuntu package [17:48:26] to do that I need to manually create a file in /etc/apt to pin wmf-maria10 package to a specific one [17:48:29] yes, but not openssl, yasl [17:48:44] which may not be compatible with out configuration [17:48:53] leave it as is, it will be easier [17:49:06] we can do it later at any time [17:49:24] ok, from ldd I see libssl.so.1.0.0 [17:49:26] I just do not know if it will rstart [17:49:38] if it works, great [17:49:38] ok leaving as it was [17:49:45] as you want for now is working [17:50:02] maybe the 5.5 was compiled differently than older 10s [17:50:16] "it's complicated" [17:51:53] yeah :) [17:52:17] I am searching the ticket(S) so you can say WTF! [17:52:21] volans powering up [17:52:44] cmjohnson: great, thanks, last one ready in a couple of minutes [17:52:53] how bad whas this one? [17:53:13] cpu1 looked thin but cpu2 was fine [17:53:29] makes sense...all of those servers were part of the same batch [17:54:01] yeah [17:55:02] so it is https://phabricator.wikimedia.org/T64615 plus another in which we reach the conclusion that there are 2 million possibilities of query types [17:55:34] not 2M different query, 2M different query plans, and it is almost impossible to serve all of those [17:55:47] ehehe ok, I'll read it later [17:55:55] yes, not the time [17:56:05] to upgrade db1065 I'm doing: apt-mark hold wmf-mariadb10 [17:56:43] [simulation] The following packages have been kept back: wmf-mariadb10 [17:56:51] it's ok if dependencies get upgraded? [17:57:20] yes [17:57:44] ok, I'll remove the hold after [17:57:54] you can leave it, it is ok [17:59:07] cmjohnson: another 3 min sorry :) [18:00:02] you can blame me [18:05:09] no worries..i have other things to do [18:05:17] shutting down now [18:05:47] x1-eqiad has a new master [18:05:49] cmjohnson: all yours once dead [18:05:54] jynus: great! [18:06:02] 10.0 now [18:06:29] where needs to be updated? (configurations) [18:06:50] I am doing mediawiki now [18:06:50] cool [18:07:04] db1070 back running, so I guess snapshot should be happy again [18:09:25] cofirming the errors from the LVS or mediawiki watchdog (/wiki/Special:BlankPage) [18:10:34] they are not going to anothe rhost? [18:11:01] matbe the load balancing tests here first like round robin? [18:11:11] maybe it fails and it tests another [18:11:20] we only see the mysql side of things [18:11:31] so, it is difficult to say [18:11:36] I'll ask someone else to have a look, if possible [18:12:07] I do not think it is worth it, but ok [18:12:26] I still see errors to db1070 one minute ago [18:12:36] is it lagging? [18:13:25] it was [18:13:27] in sync now [18:19:03] volans powering up now [18:19:23] cmjohnson: great thanks a lot for all the help [18:19:34] no problem...sorry about the timing [18:19:45] no problem [18:22:23] what do you see critical that is left about the maintenance? [18:24:13] apart the Db we talked about earlier (1019, 1026 and 70s), unless you want to do some schema change, nothing much [18:24:28] maybe reboot/upgrade a master? [18:25:07] cmjohnson: still not pinging db1065 :( [18:25:17] k [18:26:50] masters too make sense [18:28:32] I am going to deploy puppet changes for the certs of the masters, and restart them, performance impact should be low [18:28:48] unlike some of the slaves [18:29:01] you mean old masters? [18:29:08] new eqiad masters [18:29:20] they have already the new certs [18:29:40] do not they have double certs? [18:30:00] no, only s2 codfw master have double CA [18:30:12] let me review puppet, I have like a complete mess [18:30:16] and db1018 was having it manually, not in my.cnf [18:30:17] sorry about that [18:30:32] oh, yes [18:30:36] no prob is confusing and you were on cvacation while I changed that [18:30:41] because we coulf restart those [18:30:50] because they used not to be the masters [18:30:54] sorry [18:31:07] I was mixing old and new masters [18:31:18] old masters are not a priority [18:31:33] they are not fully upgraded though, if you want some "fancy" new thing, but usually is better to upgrade slave first :) [18:31:46] fully as in version? [18:32:12] OS upgrades [18:32:23] mmm [18:32:29] mariadb should be 22 on all ubuntu and 23 on jessie [18:32:39] there are some kernel things that were pending [18:32:51] but low impact [18:33:26] actually, now we will be able to deploy gtid [18:33:45] and that will simplify swithover, including master one [18:34:28] + InnoDB replica tables which means slave crash == fully consistent [18:38:51] the problem with watchlist is not only time (that will take some time) it is also testing- if I convert now 800 wikis and it fails, m*rk will cut my throat [18:39:02] I promised to do only "safe" maintenance [18:39:13] agree! [18:39:33] 65 is back [18:39:45] I was thinking more indexes but with masters on 10 should be doable also online [18:39:46] yes and no! [18:39:51] there is a bad DIMM [18:39:54] oh [18:40:05] free show 161151 [18:40:40] api servers -we can shutdown easily [18:40:53] I do not think it is a blocker [18:40:59] (1 of them) [18:41:12] Description: Correctable memory error rate exceeded for DIMM_A1. [18:41:34] right now is up, did you remove that DIMM? [18:41:42] it's now on B1 but it will mess with the boot process....if you reboot and get stuck on the console hit f1 [18:42:02] no, it's still there. I need to contact Dell and get a new one sent [18:42:12] we will need to power off again to replace [18:42:27] how dangerous do you consider it, better to not give service? [18:42:42] ok this one is less critical than 1070/71, we can depool it easily [18:43:41] because if it is "we will need to reboot again", not an issue [18:43:47] not dangerous at all [18:43:51] ok [18:43:57] but if it get memory corruption is another thing [18:44:03] I prefer to have lots of servers up [18:44:18] because we had capacity issues on the first switchover [18:44:30] and eqiad is slower than codfw [18:45:41] ok, restarting mysql then [18:46:53] actually, I may not touch any other slave [18:47:21] to avoid issues? [18:47:23] because new hardware will arrive soon, and that will give us enough capacity to depool servers for maintennace [18:47:32] aside from that, of course [18:47:56] if masters are okish [18:48:01] jynus: did you start mysql on db1065? [18:48:06] no [18:48:11] it started by itself WTF [18:48:28] maybe it is auto-start? [18:48:34] which is a bad thing [18:48:45] but hopfully not fatal this time [18:49:18] logs are clear, replica too started [18:49:22] auto start and start slave? [18:49:28] pure fail [18:49:31] for us [18:49:54] epic [18:50:24] rc2.d/S19mysql -> ../init.d/mysql [18:50:28] well, I think things are looking good, so we should relax [18:50:28] yep, all runlevels [18:50:52] I may do some things later (like preparing the swithover and commits) [18:50:59] or checking things in general [18:51:46] I moved also db1047 and did not log it [18:52:11] even if I do not say it, volans, thanks for the effor these days [18:52:26] *I do not say it enough [18:52:44] thanks, but you don't have to say it :) it's the minimum I can do [18:53:59] so, let's start planning the migration to MySQL 5.7 [18:54:43] lol [18:54:49] mmmh strange thing on db1065 [18:54:52] puppet failed [18:55:04] /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install wmf-mariadb10' returned 100: Reading package lists [18:55:18] Error: /Stage[main]/Mariadb::Packages_wmf/Package[wmf-mariadb10]/ensure: change from held to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install wmf-mariadb10' returned 100: Reading package lists... [18:55:21] that sounds familiar [18:55:34] why is it trying to upgrade maria package? that is on hold [18:55:50] so it failed [18:55:51] E: There are problems and -y was used without --force-yes [18:56:19] it is the state [18:56:34] it doesnt understand that hold== ensure=> installed [18:56:51] I hope so... trying to remove the hold and force puppet noop [18:56:55] so, technically we should do ensure => hold [18:57:11] or something, but yes, just unhold it [18:58:22] yep, seems to fix it without doing stuff [18:58:55] short list TODO: prepare the lists of new masters [18:59:09] a script to restart the pt-heartbeat manually [18:59:13] and kill it [18:59:17] FYI: also the check of DPKG was failing [18:59:27] yes, with salt [19:00:06] and now we can use mysql/salt more reliably to set read only [19:00:19] with all masters in 10 [19:00:46] same procedure [19:00:49] after that, reimage the old masters to jessie [19:01:05] (I would like to keep the contents) [19:01:34] so that will get reed of most precises [19:02:07] and slave certs checks [19:02:46] we can reimage the slaves to jessie slowly- (only the faster, on warranty ones) [19:02:59] ok [19:03:05] due to decom [19:03:13] and we have to decom lots of servers [19:03:28] also, befer the swithover, check s3 eqiad config [19:03:40] maybe we should use only the 3 new servers [19:03:48] if the others start lagging [19:03:53] at least have the patch prepared [19:03:56] ok [19:04:51] it would be nice to have salt groups per shard and per role (master) [19:05:05] cmjohnson: will you update T132515 with things done and things to do? for the DIMM probably better a different task [19:05:06] T132515: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515 [19:05:19] jynus: yes yes yes, I always become crazy [19:05:22] we need them! [19:05:40] servers don't change shard [19:05:53] and if we put master in hiera is a single place [19:06:03] hiera now [19:06:08] + MW config of course [19:06:15] outside of puppet at some point [19:06:28] yes, but somewhere salt can read :) [19:06:33] yes [19:06:51] also, I am thinking [19:07:07] of changing puppet to run pt-heartbeat only if one is the primary datacenter [19:07:14] so we have masters on each datacenter [19:07:29] use the mw-primary (or somethign like that) [19:07:30] make sense [19:07:48] but it shoudl be out of puppet, as you said [19:07:57] all slaves equal [19:08:21] maybe we could also have pt-heartbeat running on the passive masters too, maybe just once a minute, to ensure replication in the other direction works [19:08:48] problem is replication breakage [19:09:04] the use different IDs in heartbeat.heartbeat [19:09:07] *they [19:09:14] we would need like a new column shard + datacenter [19:09:22] because now we check based on shard [19:09:39] I am not saying we shouldn't [19:09:40] not also master ID? ah no for third tier [19:09:41] true [19:09:49] I would like to [19:10:02] but the logic is not easy [19:10:08] and not that urgent too [19:10:22] but yes, it is something I would love too [19:11:20] I am happy with what we achieved [19:12:03] taking a break [19:12:36] me too [19:33:11] jynus: FYI for X1 binlog_format: db2009 has ROW, new master db1031 has MIXED and db1029 has ROW, probably better to set ROW to 1031 too [19:47:58] yes [19:48:53] you do it? [19:49:01] to avoid stepping on each other [19:49:08] yes [19:49:12] I'm preparing some stuff for tomorrow and updating the etherpad [19:49:21] I'll ask for your review when done [20:04:59] Needs moar x1 [20:07:26] more? [20:10:13] the rebase [20:10:24] pc are handled automatically [20:10:40] but I miss es, I am not sure in which state are they [20:10:47] I saw that [20:11:09] I was about to check [20:11:29] probably we are writing heartbeat in the othe direction? :) [20:11:40] or you fixed it manually [20:12:40] mmm [20:12:51] I would guess they are broken [20:13:35] pt-heartbeat is running on es1015 and es1019 [20:13:46] I was right! [20:13:51] and is NOT on es2015 and es2018 [20:14:03] * volans testing the script to kill/start pt-heartbeat [20:14:16] (just grepping of course) [20:14:17] wait, wasn't it YOUR change what we deployed? [20:14:27] totally your fault [20:14:36] :-D [20:14:58] surely my fault [20:15:09] to my defense the commit was saying s1-s7 :) [20:15:13] ha ha [20:21:23] if you wait a second to fix it we can try my script [20:23:15] pc* need pt-heartbeat? [20:23:20] yes [20:23:26] and the shard names are? [20:23:38] pc1 pc2 pc3 [20:23:44] very original everything [20:23:47] lol [20:23:54] but do not need master true [20:24:27] as there are only 1, it is handled by current primary site [20:25:13] ok, check /home/volans/eqiad-start-pt-heartbeat.sh on neodymium [20:25:33] $heartbeat_enabled = ($::mw_primary == $::site) [20:26:21] this is puppet? [20:26:43] yes [20:26:53] for parsercache [20:27:01] ok [20:27:07] needs more kill perl [20:27:28] is in the etherpad [20:27:39] I can merge them :) [20:27:54] was to be able to check it actually killed [20:27:55] I do not see it, where? [20:28:40] I keep them separated to be able to check, pgrep and pkill [20:29:28] all the blue stuff it's me [20:30:12] pgrep is tested, pkill and pt-heartbeat start not [20:30:36] same for SELECT (tested) and SET (not tested) [20:31:13] we should be able to use salt now as intended [20:31:36] like? [20:32:01] without the shards and masters not yet [20:32:18] run it in parallel, although I not sure I would trust salt [20:33:14] the pkill I can put the list there instead of the file [20:33:30] the start each one has a different shard, I can add & for each line [20:34:12] I wonder if there could be race condition [20:34:41] where/ [20:34:42] ? [20:34:44] the patch is merged, but puppet is already running with the old catalog, killed manually, but gets back to life [20:35:12] we can disable puppet 35 minutes before [20:35:16] on the masters [20:35:19] +1 [20:35:21] this can happen I totally agree [20:35:36] we had issues with more important things in the past [20:35:55] I learned not to trust puppet and salt [20:37:11] if you tested SELECTs it is ok, my only issue with that would be missing grants, if select works, SET will work [20:37:25] ok [20:38:11] seems ok [20:38:24] it also seems that you want to apply all those tomorrow [20:38:30] while I check the logs [20:39:30] lol :) [20:39:57] happy to do that, as you want, it's the same for me [20:40:03] there is a missing step [20:40:09] I added that to the wiki [20:40:30] Also set parsercaches read_only=off for the new datacenter [20:41:16] "new"? [20:41:39] so, set all as read-only, wait 1 second, then set only parsercache in eqiad as rw [20:41:56] actually, set all read only except parser cache [20:42:20] wait just before the traffic switch [20:42:57] set parsercache on codfw read-only, wait 1 second, set parsercache on eqiad read-write [20:43:44] neodymium:~$ ls /home/jynus/*-parsercaches.txt [20:43:47] ok, in which phase? [20:44:00] in paralel to varnish 2 [20:44:24] errors are going to happen, we can only minimize [20:44:31] phase 5 [20:44:44] right? [20:45:03] 5-12 [20:45:28] ok adding there [20:46:21] pff, now we have duplicate numbers [20:48:33] I've renumbered the others :) [20:48:46] needs more eqiad RW [20:48:53] let me finish :) [20:49:34] ha [20:50:33] I will alwyas do the check command again after the change, not putting in the etherpad to avoid duplicates, too verbose [20:55:15] and needs more enable puppet [20:55:26] post read-only [20:55:41] I still need to disable them on top :) [20:55:55] it is already, I think [20:56:19] step 3. tehe first one [20:56:47] wwas only codfw now are both [21:02:20] if puppet is disable we can merge before the change on puppet [21:03:07] true [21:03:16] assuming the scripts works [21:03:27] and does not fail horribly [21:04:20] I am not happy with pt-heartbeat, in some cases works well, but being external to the database creates a lot of problems [21:04:29] plus it is a SPOF [21:04:40] who checks the cheker [21:04:46] or we can run it on the wrong node and not realize it [21:05:04] yes, what we want to do with es2/3 ones? [21:05:44] 2 options- testing the script now [21:05:49] or leave it as is [21:06:02] in fact, we could migrate all now [21:06:33] pt-heartbeat runs as root to overpass the read-only restriction [21:06:46] before the read only, after the puppet disable? [21:07:13] the order doesn't matter much assiming puppet does the right thing [21:07:24] that is: nothing! [21:07:38] as long as puppet does nothing we are ok :D [21:07:52] less stuff in the RO period the better [21:07:59] I'm moving it [21:11:12] if for any reason you move es2 and es3 [21:11:23] remember to put it on the CR [21:11:42] at this point I'm tempted to leave them as is [21:11:56] and removing the hosts from the list where I kill/start pt-heartbeat [21:24:25] I've done it --^^^ just commenting es1* in eqiad-start-pt-heartbeat.sh, leaving the pgrep and pkill, will not hurt [21:24:44] in case you don't agree we can change them tomorrow morning [21:25:27] it's ok [21:25:49] I would move puppet before the job/maintenance [21:25:58] but it is ok as is [21:26:07] no need to change the writing [21:26:37] in theory all stuff in the same phase can be done in parallel [21:26:42] according to the etherpad [21:26:53] the numbered points [21:27:31] ok [21:31:19] what about T133185 ? [21:31:19] T133185: Database error while saving a artice - https://phabricator.wikimedia.org/T133185 [21:31:40] seems application-related [21:31:42] get lock [21:35:14] I'm about to head off to bed [21:40:47] NOTE TO SELF: increase es1* thread limits tomorrow morning [21:45:54] yes [21:46:04] the answer is this: https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [21:46:11] sorry [21:46:32] https://logstash.wikimedia.org/#dashboard/temp/AVQ1o1yDjK4nptUt61RQ [21:47:12] "due to cold caches those error happen more frequently during the failover, and for some hours afterwards" [21:47:53] we are constantly improving the performance of the queries, but we cannot assure some sporadic errors [21:49:35] sorry for the inconveniences caused. If an action fails, please retry it. If it fails repeatedly, please report it here again so we can have a look to see how we can avoid it in the future. [21:50:28] (add -rpc to the search to see the actual user impact) [21:51:10] 52 hits in the last 2 days with tendency to disappear [21:51:27] you have to be very unlucky to get one of those [21:59:29] great, thanks for the explanation