[01:10:03] no, i did nothing with es1002 yesterday [02:55:40] wandering down the analytics rabbit hole now, for various reasons [02:55:57] should only be touching dbstore* and m4 hosts [02:56:12] (in as as-yet-undefined way :) [03:02:14] jynus: db1045 will repool later today, in s5, as normal unpartitioned slave. it could become s5 "vslow, api" if you want to tackle db1049 errors sometime [03:04:31] great work steaming through maintenance on the poor, unloved external storage hosts, btw! [06:20:08] springle, regarding es1002, then something ignored the depool (dumps?) and that is scary [06:27:05] jynus: that's unusual, but not unheard of [06:28:07] my suspicion would fall on: wikiadmin traffic, wikidata-related job especially [06:28:42] they can be slow to pick up depool changes. also the snapshot100x hosts can be slow, and do use external storage [06:30:03] actually, you're probably right. dumps == snapshot100x. good call [06:30:42] yea, I saw dumps taking a long time to ack that [06:30:52] (a depool) [06:31:10] but ony because they were activelly in use [06:31:36] checked the scripts, they use the same mediawiki config, so not worried [06:31:45] anyway [06:32:09] 1 question, did you do something to db1047? [06:32:46] I always leave it with lag, and find it without [06:33:01] no, havn't touched it [06:33:14] research queries do appear on db1047 in waves each day [06:33:31] saw the battery problem I ticketted? [06:33:41] yep [06:33:41] it is on writethtough [06:33:56] that will suck, indeed [06:34:12] can we move them to dbstore2, at least theretically? [06:34:20] part of the analytics stuff i'm doing will hopefully lead to db1047 decom [06:34:23] not yet [06:34:27] ok, I see [06:34:48] generally, don't stress over db1047 [06:35:01] I didn't know if we were on the same page there, I see we are [06:35:03] i mean, we can't ignore it, but it's not used by any of the production analytics tools [06:36:01] gotcha [06:37:11] we had some unlucky weeks regarding disks failing, do not know if normal rates or only because I am checking more the RAIDs now [06:38:22] it seems normal to me. they seem to come in cycles, maybe due to auto-learn or something, but having several in a week does happen [06:38:23] someone mentioned a "worse disk batch than usual" in the meeting for some of the bought disks [06:38:29] oho [06:38:32] that would do it [06:39:29] how things are doing in general for you? [06:40:38] all good. i got the es100x page last night (or my app buzzed) but rolled over and ignored it, secure in the knowledge that you were around :) [06:40:47] going ok for you? [06:41:16] yes [06:42:09] personally I am happy, and now more than I get to know everybody at the job [06:42:23] do you have any major prod changes you want to do soon? maybe the P_S stuff, if you're happy with the load impact [06:42:46] I am investigating it, db1018 I think? [06:43:12] pt-table-checksum with and without P_S to analyze the impact [06:43:23] and userstats could be deactivated [06:43:33] I will report everithing on the ticket [06:43:48] userstats on prod could go, yeah. but it's useful on misc [06:43:55] (just noting) [06:44:15] yes, but P_S superseeds I think all use cases [06:44:43] ok (if we're sure :) [06:44:46] do not worry [06:44:56] * springle stops worrying [06:45:00] if "IF" we did that [06:45:06] and that is not sure [06:45:19] we would document it [06:45:29] well, it's all diagnostic stuff, so feel free to experiment and make it all awesome [06:45:54] also, I would code some extra graphs and reports for tendril [06:46:09] for P_S? [06:46:14] yep [06:46:18] cool [06:46:35] i noticed someone setup a tendril repo in gerrit, but didn't tell me [06:46:48] I got some suggestions from people [06:46:48] so will have to merge a few changes from the github repo [06:47:20] minor teaks, like more informative "title=" on tree [06:48:04] of for me, the ability to show 1 graph so it doesn't render all of them [06:48:50] (sometimes while warming the buffer I only need 1) [06:49:01] yeah, the host graphs page is pretty heavy on db1011 too [06:49:22] it does cache for 300s or something iirc [06:49:22] I think those changes are small enough to be feasable [06:49:38] intention was to pre-generate graphs [06:49:41] async [06:51:22] oh, there is one thing I am angry about! [06:51:30] :) [06:51:46] I was never user of multisorce replication [06:51:58] obviosly I am user now [06:52:24] but the syntax that mariadb chose to implement it is infuriating [06:52:36] haha [06:52:41] yeah it's odd [06:53:08] and the default_master_connection thing is easy to get confused [06:53:12] over [06:53:46] well, that I understand- it makes the other command compatible [06:54:21] it is the STOP SLAVE, not SLAVES, then ALL goes before, after? [06:54:42] and 's1' where? [06:55:10] I may have run it thousand time, and still forget [06:55:34] it is like then they did invalidate "SHOW INNODB STATUS" [06:55:53] ENGINE :) [06:55:55] yeah [06:55:58] muscular memory, too many times typed [06:57:16] i made it worse for me by hacking local mysql client, so ps == SHOW PROCESSLIST, sts == SHOW TABLE STATUS, sis == SHOW INNODB STATUS, etc [06:57:29] now i am almost useless on normal mysql cli [06:57:46] oh, not recommended [06:57:54] and have to tunnel everything [06:58:23] I did consulting and people have the strangest setups [06:59:00] here, I wouldn't mind, but just for your sanity [06:59:56] well, i never use mysql cli on servers but tunnel from iron, so wfm [07:00:54] and i never plan to go back to support or consulting :D [07:00:57] do you? [07:01:19] if WMF fires me, I may have to! [07:01:23] haha [07:01:28] not likely [07:01:40] I would agree with support [07:02:03] but it comes with the standard package of "mysql guy" [07:03:26] true [07:03:29] for I can see, I see you protesting a lot in code form with things you do not like [07:04:03] monitoring, mysql client, wm? [07:06:49] not protest. just getting my own shit done [07:07:09] ha ha [07:08:03] nobody else has to use my code. well, except you and tendril, until we have a better solution (that does not need agents or minions or perl) [07:10:32] regarding other changes I would like to do in the cluster, it may be db1069 [07:11:06] we may have to change/fix the filtering [07:12:04] and even if it works fine now and I am lazy when it works, it may be easier to have 1 server at the same time [07:13:21] please, feel free to redesign sanitarium. nobody particularly likes it [07:13:28] springle: there are scattered around there some databases [07:13:45] "percona", "ops", etc. [07:14:21] I would like to come up with a "standard" for a non-data db, and maybe put it on the filters [07:15:13] "percona" idk about, and is unlikely to be in use. "ops" we do use in a few places [07:15:43] the reason is that if we create for example checksums, they get replicated on the dbs with filters too [07:15:53] or [07:16:15] we say that checksums go to a table in the database filtered, for example [07:16:30] (as I did) [07:16:51] i have not problem with filtering "ops" by default, everywhere [07:16:56] no* problem [07:17:20] then we must be aware that we could have conflicts [07:17:29] (not on checksums) [07:18:31] nod [07:18:52] i will check what should be necessary [07:19:10] report back, not a prioritary change [07:19:51] but as I told you before, I want to go towards more thorogh checks [07:21:01] the main user or "ops" is the event scheduler. most of the old uses are no more [07:21:07] eg, https://git.wikimedia.org/blob/operations%2Fsoftware/507ec49389d2f3d95769acb163c0a1eb48de5640/dbtools%2Fevents_coredb_slave.sql [07:21:36] filtering "ops" in replication would be fine, better even, for that use case [07:22:55] wait, with filtering you mean filtering in or out? because I was talking about including it [07:23:39] oh, no i meant out. filter means exclude [07:23:46] ok [07:24:08] ops is in general filtered out [07:25:23] I want an database to replicate in for checksuming, maybe heartbeat, etc [07:25:38] got it [07:25:54] and in general, +1 sounds good [07:26:01] i think you got the idea [07:26:02] but heartbeat might upset me :) [07:26:10] we used to use it [07:26:28] you are using it on all masters right now .) [07:26:41] yes, and i've been killing it [07:26:52] context: [07:27:03] heartbeat is technically fine. works great [07:27:26] but at WMF, there are two important factors that used to make DBA life hell with heartbeat [07:27:37] I don't need to talk about heartbeat [07:27:44] now [07:27:48] I understand [07:27:57] I think I understand [07:28:09] I am mostly thinking about checksums [07:28:27] actually, i have to run for daughter birthday. will bitch about heartbeat later :) [07:28:35] * springle late [07:28:38] ok, bye [10:32:25] aside from adding back es1004 [10:32:51] I failovered es1008 to make the 5.5 es1009 as the master again [10:40:40] kept heartbeat on es1009 exactly as I find it :) [10:41:00] *found [12:07:17] * springle runs EXPLAIN for some queries from dbstore1002 processlist... and cries a bit [14:57:22] springle: jynus: We are concocting a plan to handle the leap second on Jun 30th. The idea is "try to avoid it everywhere". However we are thinking of keeping various low importance boxes around as test subjects. Allow them to get the leap second and see how they behave. Last time, some mysql databases exhibited some high load, so we 'll like to test those as well [14:57:33] any boxes we can keep ? [14:58:40] akosiaris, that depends on the ratio between "closer to production" and "can be killed" [14:58:59] we will btw doing tests anyway by injecting leap seconds into labs VMs but that can only go so far unfortunately [14:59:17] masters we cannot live without them [14:59:19] jynus: well, your call. whatever you want guys [14:59:55] but there are some production slaves that can be "crashed" without problem [15:00:04] althoug to be fair [15:00:25] I would start with labs- they have redundandcy and it is not production-critical [15:00:49] so in summary- first, one of labsdb100X [15:04:17] then, maybe one slave on eqiad db2* [15:04:26] ok. that's doable [15:04:39] for labs, ask also the labs people [15:04:43] db2* is codfw [15:04:54] yes, sorry, I meant codwf [15:05:04] ^the one you wrote [15:05:05] we plan to leave all of codfw with ntp. It serves 0 traffic so we should not be having problems [15:05:26] so, maybe db1047 [15:05:36] it has high load, but also hardware problems [15:05:46] lol [15:05:55] and not production reads [15:06:00] ok, so high load seems to be something that contributes to the problem based on experience [15:06:09] sounds ok [15:06:17] not cpu load, io load [15:06:30] but yes, more cpu load than usual probably [15:06:49] for labs, ask the labs people first [15:06:50] well ok anyway [15:06:56] it's anymore already more than what I 've asked for [15:07:01] thanks! [15:07:01] we own the mysql [15:07:11] but they own the service [15:08:00] please ping here when actually doing something, so we are ready [15:08:07] jynus: yeah, we got plans for labs, no worries [15:08:36] well, for production we will probably ping on around 29th when we will be disabling ntp [15:08:55] but do expect a mail to ops/wmfall with a detailed plan [15:08:59] oh, if you set icinga the ntp [15:09:08] it will be more than enough [15:22:23] not wikitech-l akosiaris? [15:23:47] Krenair: I suppose we can. no harm in letting more people know. I 'll avoid discussions though ;-) [15:24:03] specially on this channel [15:25:45] jynus: I was referring specifically to wikitech-l [16:22:50] akosiaris, avoiding discussions...? [16:36:05] I discovered the culprit of the daily spikes ignoring the pool: "watchdog checking permissions" [16:36:28] you will have to tell me more about that, sean [16:37:51] I meant to copy "watchdog: SELECT `TRIGGER_CATALOG`, `TRIGGER_SCHEMA`, `TRIGGER_NAME`, `EVENT_MANIPULATION`, `EVENT_OBJECT_CATALOG`, `EVENT_OBJECT_SCHEMA`," [17:32:20] ok, tomorrow I will be able to close T101084 [17:32:29] I will also restart db1018 to check performance next week without P_S [17:33:44] thinks look better than I imagined, 95% execution time of most frequent query on production/ro: 384us [21:13:31] someone at research has programmed a cron to basically do a DOS on a production database [21:28:29] I commented it, left a warning note on the cron [23:50:53] jynus: which cron, where? which prod db? [23:51:11] best to !log that sort of thing, with detail [23:51:37] (apologies if i've missed context) [23:54:49] jynus: watchdog is tendril user (i know, needs renaming). i guess this is on host with many databases, like s3 or es? but what do you mean by "ignoring the pool"? [23:57:24] we can probably turn that triggers check off. it's part of a full schema check tendril used to do, but was too heavy on table cache and permissions for s3. ganglia had a similar problem with slow I_S checking innodb stuff