[08:07:38] I am rolling in the triggers + UPDATES from gerrit:223344 [09:18:06] jynus: what is your plan for db1022? just watch jessie for some time? [09:19:58] yeah, it is a bit stalled because of other priorities [09:20:26] but there is still some work to do for jessie in general [09:20:33] want to upgrade db1030 in s6. can i try raising db1022 load while db1030 is depooled? [09:20:58] yes, no problem at all for what I saw [09:21:05] ok thanks [09:21:08] consider it full production [09:26:24] hrmm... need to finish s6-pager.sql first [09:30:23] So I was thinking about high level stuff this week, but I lack some information [09:31:32] I may send you some emails about what were your plans and about hardware purchases done [09:31:51] ok [09:56:55] db2030 do you know what was the original purpose of that host? [10:14:42] db2030 was set aside for T85150 and T85141 [10:14:49] i gather we can have it back now [10:15:52] yes, I saw that- I can double check that it is no longer in use [10:16:31] was it a misc slave, or maybe a new server? [10:16:58] db2030 wasn't anything yet [10:17:05] ok [10:17:11] T95184 is context [10:17:42] fairly sure mutant.e said definite yes, we can have db2030 back, on irc [10:17:47] but good plan to check [10:18:15] yeah, I like to triple check, specially if it involves deleting data [10:18:29] :) [10:21:53] jynus: nice going with T104900 [10:21:58] * springle cheers from sideline [10:22:21] thank you [10:22:42] I reduced the scope to "things to do now" [10:22:52] (short term actionables) [10:23:22] problem is roles are not transparent [10:23:50] we may be stuck with wildcards for a while [10:24:27] that would not surprise me, knowing labs and toolserver legacy requirements that still remain [10:24:39] but listing it all out is still a good step [10:24:48] so I focused on sanitarium [10:25:36] we may want to update labs to 10.1 "when it is stable" (which is not going to be anytime soon) [10:25:58] to have transparent roles [10:27:12] BTW, I am on the receiving ops of the maps project [10:27:22] that means postgresql [10:28:10] what do you mean by "transparent" role? is that a new keywork option for SET ROLE in 10.1? [10:28:17] yep [10:28:32] otherwise, you have to set role for each connection [10:28:54] it SET DEFAULT ROLE or something similar [10:29:06] to be fair, I learned about that the other day [10:29:26] oh ok DEFAULT :) i was trying to google for TRANSPARENT and felt foolish [10:29:35] yes, those will be nice [10:30:01] I am not a friend of a project out of a google summer of code [10:30:14] but it works for what we wanted [10:30:18] you're ok with handling postgres? [10:30:32] as in, you are willing [10:30:37] I am willing with relearning it [10:30:42] osm guy here [10:30:49] fair enough [10:31:31] tech is not the issue [10:32:04] the specs are not firm from the outside project [10:32:14] ah [10:32:28] but we negotiated to not to call it production [10:32:45] no availability guarantee, no paging [10:34:37] so, back to MySQL, in general I got a hand on the current production/labs status [10:35:14] the only things I may need is some history and future, hence the emails I mentioned [10:35:58] (the ones I haven't send you yet) [10:41:29] I also regected recently this ticket: T104953 [10:42:46] * springle reads [10:46:07] agree [10:47:17] what about "communication", could the poster be offended? [10:47:49] I am trying to improve on that area :-) [10:48:28] well, you gave a good reason. that should be fine :) [10:50:29] T105135 needs your brain sometime, but no hurry [10:51:08] the high level arch question have many to do with that [10:51:13] *much [10:51:26] we seem to be synced pretty well [10:53:03] yeah. well, let's brainstorm a bit. think again about mysql/mariadb [10:53:17] or webscale :D hehe [10:53:32] I am not too much worried about that, actually [10:53:33] except i gather hhvm would implode [10:54:00] but about eliminating SPOF as you mentioned [10:54:05] to be able to do that [10:54:48] we already use the semi-sync plugins (i know, it isn't real HA). just noting [10:55:10] let me share with you what I wrote the other day [10:55:14] we used MHA once for switchover between PMTPA and EQIAD [10:55:25] but MHA isn't that nice imo [10:55:29] but please, take it as brainstorming [10:55:34] sure [10:55:39] not as "I think we should do that" [10:56:10] I am not putting it on the ticket because it is only lateral to that [10:57:36] mind emailing? i have to go and chase the kids to bed [10:57:52] just done [10:57:56] oh i see it [10:57:57] thanks [10:58:01] see you later [10:58:09] will talk long about this [10:58:09] gn [12:56:15] dbstore1002 s3 had another tokudb dupe key. minwiki.revision is now innodb [12:56:35] that issue is getting a bit annoying. /me sigh [12:57:39] lets program an upgrade [12:57:55] have to get that box upgraded. if analytics can't move soon, we shoud just schedule it [12:57:58] yeah [12:58:52] db1062 glich which lead to a short downtime? [13:06:30] was there? [13:06:43] didn't see anything to do with db1062 [13:06:48] not necessarily "in" the server [13:06:53] (db server) [13:06:59] but the connections it receives [13:07:18] did you recently pooled/changed something on s7? [13:07:32] db1041 was repooled, yes [13:08:14] https://tendril.wikimedia.org/host/view/db1062.eqiad.wmnet/3306 [13:08:48] ~8h ago.. 05:46 logmsgbot: springle Synchronized wmf-config/db-eqiad.php: repool db1041 [13:08:52] >x2 the tnormal raffic [13:09:23] (check aborted clients) [13:37:18] raised s7 db1041 to normal load (was lowered for warm up after mass re-partitioning) [13:37:36] but that doesn't explain the spike of aboirted connects on db1062 only (and not other s7 slaves) [13:37:46] so... dunno [13:37:58] oh, I do not think that directly caused it [13:38:21] just that something caused it, and that may contribute it [13:38:31] nod [13:38:37] more load, more things can go bad [13:39:12] you sleep during the most loaded portion of the day :-) [13:40:09] good news is that I discovered yesterday that a large server can reach around 50.000 of our RO queries [13:40:18] per second [14:00:56] hehe, this is why i stay up sometimes, to see the real load [14:01:30] they've been higher than 50000, though i don't know how long it could be sustained [14:02:20] any clue whay s5 is misbehabing on dbstore1002? [14:03:16] DELETE /* Wikibase\Repo\ChangePruner::pruneChanges */ FROM `wb_changes` [14:03:48] this will become a familiar site to you during wikidata jobs [14:03:50] :) [14:04:24] anything to do with wb_changes and wikiadmin is unsurprising to see causing lag [14:04:34] I saw that [14:04:42] though probably the research queries are not helping [14:24:38] one thing we have to do, and I will create a ticket [14:25:04] is get an alert when connections > 90% max connections [14:27:34] (s5-dbstore1002 alerting) [14:27:54] * springle turns off his phone and sleeps [14:27:55] :) [14:28:04] ha ha [17:14:44] sringle, the lag on dbstore1002 is probably caused by https://tokutek.atlassian.net/browse/DB-311 [17:16:24] (shard s5), which again points to T100408#1323731. More reasons to schedule an update!