[08:25:20] morning [08:26:19] good morning [08:30:05] for mediawiki commit messages the convention is "do something", instead of "doing something" or "done something" [08:30:52] true, I remember the doc (now) [08:31:12] (I do not think we are too strict for the puppet one [08:31:19] or our SALs [08:31:39] but mediawiki is owned by mediawiki devels, so I do as they command [08:32:05] so let's talk a bit about priorities [10:31:16] strange finding, on es2011 we don't have mysql-client and mysql-common and libmysqlclient18 are at version 5.5.46-0+deb8u1, checking puppet [10:31:35] we do not care about those [10:32:01] libmysqlclient should haven't been installed, unless as a dependency of other packages [10:32:13] all of that it included in wmf-mariadb10 [10:32:37] should not /opt/wmf-mariadb10/bin/mysql be linked somwhere in the path? [10:32:43] it is [10:32:53] which mysql... empty [10:32:54] see any other server [10:33:29] for this, I'm checking why here it did not work [10:34:00] probably it is done manually, /opt/wmf-mariadb10/install [10:34:17] there are legacy reasons for that (installing 5.5 and 10 at the same time) [10:34:33] probably no more reasons now, package needs to be updated [10:34:48] ok, so I can just run it manually (the install) [10:34:51] just run /opt/wmf-mariadb10/install with salt [10:34:51] thx [10:34:57] on all those servers [10:35:04] sure [10:35:12] it should take 0.5 seconds [10:35:46] if you want, let's create a ticket with all improvements we want on new packages, like missing dependencies [10:35:59] running install and all those things [10:36:15] ok [10:39:44] https://phabricator.wikimedia.org/T127811 [10:40:16] you were brought here to try to reduce open DBA tickets [10:40:28] you are failing that :-D [10:40:38] I'm opening more than closing :D [11:47:21] some errors on db1067 [11:47:39] checking [11:49:18] from where? [11:49:20] most are rpc-related [11:49:53] kibana? [11:50:04] so not big deal, but still 4 from api and regular traffic [11:50:15] yes, that is the place for quick lookups [11:51:26] you have also fluorine for greppable mediawiki errors and oxygen for http/varnish errors [11:51:55] and that is why I want to send error.log there too :-) [11:52:03] I was also wondering from where you get the alarm... :) [11:52:12] no alarm [11:52:32] just keeping an eye on it from time to time, like in a background tab [11:52:44] generally it is not very actionable [11:53:10] because, if a database gets errors, it will be depooled [11:53:47] but good for trends "is it too loaded? are there mediawiki issues?" [11:53:54] yep [11:54:20] only cause of database problems is mediawiki [11:54:29] which is not saying much, of course :-) [11:55:06] eheh [14:04:43] puppet last run on db2058 is CRITICAL: is just a random apt-cache failure with exit status 100 [14:06:42] yep, it is good you checked that, those are relativelly common [14:07:40] aside from one-time errors, there are some race conditions when puppet master restarts, etc. [14:07:57] ok [14:08:41] not worrying most of them, am puppet failures are not page-worthy, they are not a service loss by themselves [14:09:47] on my app, I have all database hosts, but I ban db1018, db1025 and puppet service [14:10:27] which leads me to test the paging I just changed [14:24:48] * volans monitoring load on es1 shard in eqiad [14:25:17] we do not have a good average latency graph, that is on the TODO [14:25:55] if we are on teh whishlist... I really want innotop back :-) [14:45:09] nobody is restricting from installing it with puppet- if you make it work [14:46:22] yeah I need to find the time to see what's wrong and how to fixit, and I'm definitely not a perl-guy ;) [14:47:16] there was a version that worked on 5.6, I am not sure if in mariadb10 [14:50:07] have I mentioned dbstore1001 and dbstore2001 being delayed? [14:50:21] yes 24h right? [14:51:00] yes [14:51:18] with all DBs [14:51:22] so they are constantly doing slave start/slave stop [14:51:57] more or less, some databases have so little traffic that that doesn't work well [14:51:57] ahhh I think that's the reason of the huge err log on the masters then [14:52:11] yeah, it logs every time [14:52:28] maybe there is an option to fix taht [14:52:57] I have doing that, it makse them difficult to manage [14:53:18] if you want to do anything, and stop slave- it may unintentionally start it again [14:54:24] true [14:54:31] so it is a bit tricky. In fact, I disabled the event log to do some maintenance and left it off, until replication complained [14:54:50] are we using pt-slave-delay? [14:55:34] no, it is a custome slave [14:55:44] pt-slave-delay fails miserably with mariadb [14:55:56] great [14:56:57] it is a custom event [14:57:24] if this was 5.6, we would have MASTER_DELAY as a CHANGE MASTER parameter [14:57:32] sadly, this is not mysql [14:57:36] yeah [15:58:51] jynus, hey [15:59:02] so, ptwikimedia [15:59:38] I don't think we really need to re-import it [16:00:25] I would prefer it have some things removed per T126892, exported and stored in some archive. I doubt it's a particularly big DB [16:00:43] and then we can make the new one in it's place [16:01:03] I am curious though - what would be dangerous about reimporting it? [16:01:14] I know you need to deal with x1, external store, etc. [16:01:26] I? [16:01:53] I suggested first using other name [16:02:36] I'd rather not break the convention [16:03:01] will you be made responible for the potential security issue? [16:03:07] if yes, I agree [16:03:30] lets do it now [16:03:36] what potential security issue? [16:04:04] providing access to potential leftovers? [16:04:59] I do not know every single place extensions use for writes, do you? [16:06:11] I'm not aware of anything writing outside of the main DB, centralauth, x1, external storage [16:06:38] and do not say there is not a possibility, because you started mentioning a security issue in a publicly logged channel, something that is very ugly [16:07:22] I don't see how that's even related [16:08:02] I see a clear connection between that issue and the one you brought up about ptwikimedia [16:08:27] You're being vague to the point of not really helping at all. [16:09:17] you know my opinion, I think it is clear [16:09:30] I left it clear on the ticket, what else do you want? [16:10:25] Someone to either clear the way to using ptwikimedia or someone to decide that using a different name is OK [16:10:46] my opinion is that we should use a different name [16:10:49] it is safer [16:11:06] Which different name fits our database naming conventions? [16:11:39] anything that includs the word wikimedia and pt [16:11:56] you mentioned pt2, i am ok with that [16:13:02] I'll check with wikitech-l whether that's really OK [16:13:06] if there is even a single row that is still somewhere, replication will break and all wikis will stop working otherwise [16:13:15] I do not decide about the wikis [16:13:57] I give you my best advice about databases- I have very little to no knowledge about mediwiki software, and 0 decision power on that [16:15:16] I am not saying the domain should be pt2, I am saying that the database should [16:15:37] do not blame me for mediawiki not being able to separate code from database [16:31:39] I don't blame you for that [16:31:46] most of mediawiki can cope with it [16:32:05] the multiversion change to make it work is a little ugly but we'll live with it [16:32:39] SPEAK FOR YOURSELF I HATE THAT CHANGE [16:32:44] * ostriches goes back to lurking [16:32:45] basically I care about breakage [16:33:06] for example, there is a ptwikimedia database on external storage [16:33:11] it is empty [16:33:35] but I have been here long enought to assume that if something bad can happen, it will happen [16:33:59] and it is not a theoretical thread, there are instances of that very same thing happening [16:34:22] is it the developers responsability to create clean up scripts? probably [16:34:47] but given that they do not even create propoer setup scripts :-) [16:35:09] something that you suffer every single time [16:35:16] allow me to be pesimistic in this case [16:35:47] it takes me 0 seconds to go to these known 3-5 places and delete everything there [16:36:36] but I do not consider it enought, and the risks (privacy leakage/full outage) are not worth the rewards, in my own personal opinion [16:36:58] I think for thise kind of things we need the opinion from the veterans [16:37:43] our setup is no longer just mysql, we have restbase and other places I do not even know about [16:38:19] disabling is easy, nuking is very difficult [16:40:41] volans, I am not surprised by the results (I've been agreeing all the time)- I was just critizising the methods (one run is not a good performance benchmark) [16:41:53] jynus: I agree with the principle, I was a 1h run though, we can do multiple, no problem, and even bigger in size (I've done just RAM*2) [16:42:07] no real need [16:42:24] let's do it, it was what was right from the beginning [16:43:03] and if it was me, I would have done it anyway, but I wanted yo protect you from forcing you to do duplicate work [16:43:32] and I wouldn't make too much sense to have only those ones different... [16:43:45] repool es1* [16:44:09] and check what is needed on the bios and on the puppet recipe (if you want) [16:44:21] sure, CR already ready, I was waiting a window in the deployments [16:44:33] for the rebuild, should I/we do it or open a TT to codfw/ [16:44:34] ? [16:44:40] we can test some xfs options, if you have the time [16:45:26] well, we need at least papaul to be aware of it for future installs, so the tickets should be ther anyway [16:46:22] ok, I'll open anyway, for XFS I can start testing and comparing with es2011 with itself, if we have gain should apply to 256 KB stripes too [16:46:35] yes [16:47:03] I do not think thouse would be a huge difference because of LVM, as I metioned [16:47:18] but I am curious about noatime vs relatime [16:48:01] writes are not a bottleneck for us, so in reality the gain would not be that great [16:48:11] but we care a lot about read speed [16:49:05] in fact, recently we had some stalls in a new pc hosts, and I may start to suspect it may be related to this issues [16:49:28] relatime changes the access time only if last access time was before the modified time [16:50:14] so if the file don't change the acces time doesn't get updated, but for DBs I guess the innodb files is updated all the time, and so will be the access time [16:51:58] which leads to not afecting much (?) [16:52:14] what a better opportunity to test?! :-) [16:53:33] close to none for tables accessed a lot but not updated often, but I think some effect is noticeable on common r/w tables, I'll probably try a sysbench oltp for this [16:54:20] jynus, oh, we don't need to worry about restbase in this particular case, pt.wikimedia.org was long gone by the time that restbase was even initially committed [16:54:44] I hope you know that I mean [16:55:29] complexity leads to bugs, and while I think core is ok, there are many small changes on extensions [16:55:43] which different degree of maintenance [16:56:43] as I said, you are the first one to suffer those on creations- deletion is 10 times more difficult [16:56:53] and a rename is a deletion + creation [16:57:03] I am more faithful on creating a new wiki [16:57:49] faithful may be the wrong word there. Confident. [17:07:01] to be fair, we have testwiki and test2wiki [17:07:09] I think it fits the pattern :-) [17:07:29] also wikimania201[567] [17:08:16] I'm not worried about putting numbers at the end of subdomains [17:08:16] the important thing is the wik[i] part, as there are some checks depending on that [17:08:16] :-) [17:08:27] no [17:08:39] yes, I know about the wik requirement [17:08:44] subdomain should be the usual one [17:08:57] what I'm not sure about is putting extra numbers after the subdomain [17:08:58] my smile was for "it fits the pattern" [17:09:50] volans, actually there is a good reason for those 2 to exist- test inter-wiki features [18:06:49] just for the record I repooled es1016 at 17:34, nothing strange there and should be unrelated [18:07:13] yes, it is the main dewiki servers, not es [18:07:41] I do not blame you on every error, I just did that once yesterday because I woke up with bad humor :-) [18:08:33] lol :) was not for that that I mentioned it, just to have the full picture [18:12:41] * volans running sysbench oltp on es2011 with current XFS options for reference [18:45:02] benchmark completed, I have to go offline now, I'll run the other ones or later or tomorrow morning