[13:27:20] jynus: hey there [13:27:30] hi [13:29:02] Any reason you chose to not depool db1045? [13:29:22] why depool it if there are not queries there? [13:29:33] The api is read only [13:29:36] because of that [13:30:11] it was read only when both were lagged [13:30:18] it should be ok now [13:30:24] hm [13:30:31] but we have no bot edits incoming [13:30:33] oh right [13:30:42] pywikibot sets maxlag [13:30:52] which takes the MAX of all lags [13:31:01] mmm [13:31:47] in any case, the other server just got synced, lets wait some minutes to see this one is stable [13:31:59] Sure [13:32:01] I am ok with depooling [13:32:18] its way more important that humans can edit than enabling bots [13:32:26] just it was unneeded some minutes ago [13:32:45] no, I care about API too, but if both servers were saturated, I could not do anything [13:33:07] it just recovered, now we can see if we can do something [13:33:13] Sure, just saying that its more important to keep humans happy [13:33:36] sure [13:35:30] my fear is depooling db1045, then 49 gets saturated, I am just not rushing [13:36:06] the other big user of API is gadgets, that is "closer to the human" :-) [13:36:39] I am monitoring https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError closely [13:36:47] Humans don't set maxlag [13:36:52] yep [13:37:01] but actually overloading the shard would be awful [13:37:40] PK changes are difficult :-) [13:38:36] to be fair, it is kind of expected, the best servers are reserved for "human" groups [13:39:14] what I did not expect is the recentchanges lacking a PK [13:39:50] luckly, wikidatawiki and dewiki can take that load, if it was enwiki it would have been an outage [13:40:38] indeed [13:40:46] we had several close misses on Wikidata before [13:41:30] is wb_terms the largest table? [13:41:58] Largest as in? [13:42:16] most rows or size [13:42:25] No [13:42:25] or the largest of non-core [13:42:30] largest non core [13:42:33] yes [13:42:42] and also maybe largest w/ indexes [13:42:46] but not pure data length [13:42:53] I assumed that [14:17:02] jynus: Do you think making db1070/1 api servers is worth a shot? They seem ok now, but gather replag here and there as well [14:17:43] The api servers get a lot of stupid queries on the revision and logging tables [14:17:51] no, I prefer to keep them API-free [14:18:01] service is more important [14:18:09] and even if they are better servers [14:18:16] true [14:18:23] might set load to 0 on db1045, though [14:18:32] yes, that is ok [14:18:36] Shall I? [14:18:46] go on if you want [14:18:51] if not, I will do it [14:19:06] we will have a degraded API for a few hours [14:19:25] there only this much we can do with the avilable hardware [14:22:58] just, if you are going to be around, monitor with me any incoming ticket/user report [14:23:14] Not much longer, sadly :/ [14:23:25] I didn't see anything at dewiki, yet [14:23:28] and they are usually fast [14:23:34] :-) [14:23:41] Last time I sent it read only they noticed within a minute [14:23:50] probably you saw the warning I sent? [14:24:00] On Wikidata? Yes [14:24:30] lets go though this as fast as possible, it is at 36% [14:24:45] probably only 3 hours [14:24:52] that'd be nice [14:25:01] for next time, we will schedule it better [14:25:23] also, having several changes at the same time [14:26:26] Yeah, I was quite surprised to get paged :P [14:26:33] But wasn't able to react then [14:26:55] Still managed to see it was a schema change and not something else going mental [14:27:25] the pages thing is a monitoring problem [14:27:30] I will see to fix it [14:27:41] The eventlogging ones? Yes [14:27:45] Very annoying [14:27:46] there are some non-critical alerts going though [14:28:09] icinga is always changed here and there and we end up with bad configs [14:28:32] I looked at it very briefly a few days ago [14:28:37] that should alert but not page [14:28:45] from puppet I can't see why it's paging [14:28:55] indeed [14:29:10] I think it is some combination of things, like dbas group always paging [14:29:17] or something like that [14:29:20] will look at it [14:29:34] it also got combined with problems at m4 replication the week I was on vacations [14:30:56] also, in the future, a database going down shouldn't page, only if the *service* goes down [14:31:48] yeah, MediaWiki should handle on DB down fine [14:31:52] just replag is awry [14:31:57] and well, to many down is awry [14:32:06] * one DB down [14:33:22] for that probably we will have to move database handling outside of mediawiki [14:33:35] which there are plans of [14:34:45] Could probably also be done within MediaWiki [14:34:47] but I do not care if a Server is down, I care if API is down or in read-only, if recentchanges is not updated, if queries are slow [14:34:51] it's LoadBalancer is not that bad [14:35:06] it is not the only reason [14:35:16] deploying to change weights/depool? [14:35:19] that is crazy [14:35:24] True [14:35:36] I need to automatize that [14:35:37] but once we touch that, it will probably be down to ops only [14:35:51] which is ok, I am an op :-) [14:35:57] Well, you are [14:36:21] the idea is to automatize everthig so human intervention is mostly not needed [14:36:36] we can still have scripts to call to depool [14:37:02] that is the actual thing, I do not see a reason to spam mediawiki-config with depool-repool [14:37:09] True [14:37:19] see for things like this: https://phabricator.wikimedia.org/T120122 [14:37:36] it is crazy that I will need to do 300 commits to do that [14:38:08] the current process is back from the days where domas was just editing the files on zwinger :P [14:38:28] do not get me wrong, the curren process is ok, but it does not scale [14:38:52] I want (and I am building) a ./repair-db.py db1026 [14:39:13] that depools, copyies from another slave and repools automatically [14:39:27] That sounds nice [14:39:30] All of that sounds nice [14:39:46] you do not see them feasable? [14:39:47] I'm just a little unhappy that it'll again move stuff outside of what deployers can touch [14:40:28] that is actually a falacy, deployers currenty cannot create schema changes [14:41:20] They can... wikiadmin can alter tables [14:41:28] with my arch, they could [14:41:37] no, they cannoty [14:41:38] (but they most of the time shouldn't) [14:41:55] RelEng has refused to take part on that [14:42:15] hm... since when? I used to do minor schema changes on my own [14:42:15] as I am in charge of that, my solution is to automatize it as much as possible [14:42:47] https://phabricator.wikimedia.org/T121857 [14:43:17] and https://phabricator.wikimedia.org/T119626 includes a load balancer/client side pool [14:43:20] subscribed [14:43:38] It'd be awesome to make that simpler [14:43:47] I am ok with thinking it is ambitious [14:43:51] there are so many schema changes I have in mind :P [14:43:56] but not seeing there is a problem! [14:44:21] and that we are large enough that we cannot continue doing things by hand [14:44:46] Yeah, with two DCs and sever shards (not counting misc stuff) [14:44:51] * seven [14:44:59] 16 [14:45:02] :-) [14:45:15] Well, counting es and pc? [14:45:25] x1, m1-m5, es1-3 and pc [14:45:39] Those are just misc for me :P [14:46:23] for being misc, they anoyed you a bit (eventloggin) :-) [14:47:21] I think schema changes should still be reviewed [14:47:29] but self-server, or almost [14:47:36] *ved [14:48:14] Sure... bad schemas are very annoying [14:48:31] I know what I'm talking about :P [14:48:39] ha ha [14:49:51] I got to go [14:49:51] cu [14:50:03] bye, and thanks for the help! [17:44:42] jynus: I think db1026 is done [17:47:56] yep [17:48:06] 4 hours 35 min 8.00 sec [17:49:37] it will lag until the main schema change is done [17:49:48] but at least replication is fixed [18:13:46] [eecbd253] Exception Caught: DB connection error: Too many connections (10.64.48.25) [18:13:51] shit [18:13:57] jynus: ^ [18:14:19] where? [18:14:24] 1070 [18:14:31] and 1071 is close to exhaustion as well [18:15:48] I do not see that [18:15:52] 500 connections only [18:16:00] yeah, it's good again [18:16:06] I wonder what happened [18:16:16] oh, it got depooled now [18:16:35] https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb=&format=xml [18:16:57] $ mysql -h db1070 [18:16:57] ERROR 1040 (HY000): Too many connections [18:16:58] but I see the errors [18:17:13] is it the api server? [18:17:18] no [18:17:28] the one with generic load [18:18:31] I see it now [18:18:37] enter on port 3307 [18:19:12] ah [18:19:17] it is the metadata lock [18:19:29] it means the schema is about to finish [18:19:43] ah ok [18:20:02] not expected [18:20:14] why is only that box acting up? [18:20:50] it depends on the ongoing queries [18:21:03] oh, right... [18:21:12] db1070/71 get all terms table queries [18:21:14] metadata locking == ddl is locked until all selects finish == selects pile up [18:22:45] there is a trick to avoid it, but I didn't use it because it didn't happen when I started it [18:23:12] how long will that take on db1070? [18:23:57] it should have gone away now, I will kill SELECTS now [18:56:26] jynus / nuria: I heard some rumors about EL replication being only 3 hours. Here's what I was looking at: [18:56:30] https://www.irccloud.com/pastebin/DMv7gPmG/ [19:01:50] nope [19:01:57] look at my tool: [19:02:25] https://phabricator.wikimedia.org/P2487 [19:03:11] MobileAppUploadAttempts_5257716 is a false positive, as I explained on the ticket [19:03:35] all schemas not shown there have less than 1 hour of lag [19:04:08] this only measures lag between mysql servers, if there are events not inserted on m4-master, thats out of my control [19:08:06] milimetric, here it is the verbose mode: https://phabricator.wikimedia.org/P2487#10300 [20:04:32] ok, so here's what I saw: [20:04:42] m4-master had data up to the very last minute [20:04:52] analytics-store was behind a few days this morning [20:05:03] and now it's caught up all the way to be behind only an hour [20:05:11] but an hour or so ago it was behind like three hours [20:05:54] so I'm thinking that the replication script inserted everything, but maybe it wasn't done with the insertion transactions. Are we still inserting only 1000 rows at a time using the way ottomata modified the script? [20:06:09] no, I think you are comparing different things [20:06:21] I am compating lag between mysql [20:06:23] s [20:06:37] you are checking non-inserted data [20:07:08] what I cannot know, for example, is if there are pending logs to be inserted [20:07:46] max(`timestamp`) is also 20160120194245 for me [20:08:08] but both on the master and the slave. If that is correct or not, I do not know [20:08:09] right, which is still a little bit behind m4, where it's: 20160120200657 [20:08:28] yes, but that is within the normal parameters [20:08:32] 1 hour [20:08:37] yea [20:09:14] there sometimes huge inserted batches, and mysql cannot keep temporarilly with them [20:09:24] but then it catches up [20:10:13] I mean, the m4-master db is always caught up [20:10:25] so you mean the batches that happen as part of the replication script, right? [20:11:40] no, insertions to the same table [20:12:02] I thought evenlogging acted as a buffer [20:12:08] to WikipediaPortal_14377354 on m4-master? [20:12:26] that's always up to date as far as I can see: [20:12:27] no, in general- there are large inserts to Edit [20:12:29] https://www.irccloud.com/pastebin/JxAH6PYJ/ [20:12:51] yeah, there's a buffer but only for maximum 5 minutes [20:13:04] oh, I missread the 20 as a 12 [20:13:15] me too, I did that earlier :) [20:13:24] then I think we are discussing while evertithing is right, isn't it? [20:13:45] everything's ok, but analytics-store was behind this morning and slowly caught up [20:13:47] "right", there is room for a lot of improvement, but there is not 3 day lag at all [20:14:00] i'm mainly wondering if we removed Andrew's limit of 1000 rows at a time for the replication scripts [20:14:04] yep, I have been fixing issues from lat week [20:14:09] it was a hack he did and in my theory it explains a bit of what I'm seeing [20:14:16] cool, thanks!! [20:14:18] all is well then [20:14:22] the recovery? [20:14:35] less failures, but I do not think that could keep up [20:14:53] the main thing is to move the replication process outside for the slaves [20:15:02] k [20:15:10] because the do not have enough IOPS to do it themselves [20:15:24] I have not puppetized it yet [20:15:47] but I have been running it manually, will document it properly when I find the time [20:26:44] :( I'm sorry you have to deal with this, you have my support :) [20:27:30] deal? [20:27:41] this is my job [20:28:17] I would love to have more time to do it more carefuly and properly, but other than that, everithing is ok [20:28:57] the replication script was well intended, but both Sean and me had very little time to improve it