[08:06:30] we just had a peak in connections to es1008, which caused some connections denied- probably related to lag [08:07:02] hello [08:07:27] moritzm: hmm, should we just do this in -operations? more people, easier to !log... [08:08:32] most of them are jobs [08:08:36] so no user notice [08:08:43] I was thinking of using -databases since it's the more targeted audience not to spam -operations too much (we can still log things in -operations) [08:08:47] but I'm fine either way [08:08:48] ok [08:08:53] moritzm, YuviPanda do things on operations [08:08:56] ok! [08:09:18] this is not a substitute for that [08:09:27] all ops should be aware [08:11:03] +1 [08:11:12] ok [08:11:30] I promised people should not need to subscribe here [08:12:05] and they will complain otherwise (with reason) [08:12:18] before we proceed with enabling it, I would like to double-check whether we need to disable connection tracking for the 3306 port, what rate of connections are we seeing for the databases? [08:12:58] the connection table size is 64k and when it maxes out, further connections are not accepted any longer [08:12:59] new connections/s? [08:13:22] active connections in general [08:13:41] (let's switch to -operations :) ) [08:14:30] for some services (elastic, swift, mediawiki) we've disabled it [13:10:46] jynus: ok to merge https://gerrit.wikimedia.org/r/#/c/233670/ now? [13:12:14] moritzm, yes [13:12:26] ok, thanks [13:17:36] db1048 enabled, looks all fine [13:19:46] yeah, that is easy [13:19:58] we could try the master, too [13:20:46] sure, how about https://gerrit.wikimedia.org/r/#/c/233671/ ? [13:21:00] or master only? [13:22:54] yes, that is ok, just for that, make me sure we do it "live" [13:22:57] in fact [13:23:14] let's do it early tomorrow [13:23:55] because usually the do not like phabricator-related maintenance at california wroking hours, ok? [13:24:08] does that work for you, moritzm [13:24:36] sure thing, let's do that tomorrow morning euro time! [13:24:42] perfect, then [13:25:43] most of codfw may be mostly safe, too [13:26:39] yeah, I reckoned so wrt codfw [14:08:26] AndyRussG, thank you, will be back to you, do not worry, I won't forget [14:08:40] ok? :-) [14:16:25] jynus: sounds great! really appreciate it! :) [14:18:09] jynus won't forget but might be like Phabricator. Having a lot of connections at once ;) [14:19:27] that is why he should ping me and not the other way [14:39:35] ok, so back from the meeting, AndyRussG we can discuss the issue now or later [14:49:27] I see your comment now, I will ping you back at 16UTC [14:54:41] jynus: yeah! just running out the door now :) [14:54:45] thx again! [14:54:54] mp [14:54:58] np [15:22:03] jynus, I'm wondering about https://phabricator.wikimedia.org/T106847 - are you planning to make a new s* cluster and move some larger wikis into it, get new drives to add to the current s3 hosts and move some data into those, or get larger replacements for the existing drives? [15:22:36] no plans yet [15:22:55] I see [15:23:05] phylosophy is consolidation [15:23:14] Are any of the above definitely not going to happen? [15:23:24] not? [15:23:57] new hosts and disks confirmed, SSDs wanted [15:24:09] new cluster highly unlikly [15:24:23] moving some of them around maybe [15:24:26] I see [15:24:38] my only serious concern would be wikidata right now [15:24:54] that shares with dewiki right? [15:24:58] make it dedicated if it will be a large thing [15:25:06] but I am speculating [15:25:29] do not intend to do large changes at all "if it works, do not touch it" [15:25:35] heh, ok :) [15:25:42] Is it possible to move databases to different clusters like that without a large amount of downtime? [15:25:58] yes, not easy, but yes [15:26:12] I guess you could replicate it to the new cluster and eventually swap mw to use the new servers for it? [15:26:22] yes, that is how it is done [15:26:48] specially, combined with new hardware is easier [15:27:18] but the general idea would be having less nodes per shard [15:27:41] because now we have lots that are very weak [15:28:13] for SSDs we have to save in hosts, but we win in overal capacity [15:28:43] again, all speculation, but that was Sean's plan, and I intend to stick to it [15:29:33] when we start seriously working on that, we will have to perform a load and data disk analysis per database [15:30:14] also, s3 is starting to not scale with so many objects, but that may change with newer hardware or conf changes [15:48:11] why does es1 not show in tendril? [15:48:53] I asked the same to Sean :-) [15:49:01] it is read only [15:49:17] the page is generated dynamically with the actual topology [15:49:26] not the one on the git repo [15:49:30] which is nice [15:49:59] but as there is no replication between the read-only nodes, it would not know that they are part of the same shard [15:50:11] you can also call it a bug [15:50:47] it is difficult to generate that graph dynamically- multisource replication, circular replication, etc. [15:51:10] but Sean did a great job [15:55:34] interesting, yeah. that probably is a bug, but the rest of the graph is pretty good [15:56:03] sean wrote tendril? [15:56:13] yep, at least most of it [15:56:24] so patches welcome [16:04:18] jynus: hi! I'm back at the keyboard, available to look at the schema change NE time u like :) [16:04:44] AndyRussG, great, 1 sec and I am 100% with you [16:04:51] K thx :) [16:05:51] so, first of all, apologies if my email answer sounded harsh [16:06:32] I didn't intend to [16:06:44] I just wanted to know more of the context of the schema change before applying it [16:07:12] in particular, I need the following information: [16:07:26] database or list of databases to apply it to [16:07:43] if it depends on any commit to apply it [16:08:00] if it has been tested on testwiki/beta [16:08:13] and if it is backwards compatible [16:08:25] with current code [16:08:31] jynus: K cool! np also BTW [16:08:50] in revers order... it is backwards compatible [16:09:08] that backwards compatibility has been tested, in fact, since it was deployed on the beta cluster a while before we deployed the code that uses it [16:09:25] ok, great: that helps me have more confidece on it :-) [16:09:44] you will be suprised in what state patches sometimes arrive :-) [16:10:39] heh yeah I understand... [16:11:01] We've been testing it on the beta cluster for a few days, though admittedly we haven't been focusing a lot on database operations, though that was smoke-tested locally quite a lot [16:11:12] could it be applied right now? [16:11:16] and what databases? [16:11:17] yep! [16:11:52] It would go to all databases for wikis with the CentralNotice extension, which I think is all wikis [16:12:03] yes, all except labswiki [16:12:17] cool! [16:12:33] Also, most of the change is just to add new tables to support the new feature we're rolling out [16:12:35] ok, so let me tell you what I will do now [16:12:50] yes, actually, the new tables is a no-problem [16:12:55] The only change to existing data is for our CentralNotice log table [16:13:02] yourself can do that [16:13:08] the main issue is the alter [16:13:33] even on small tables, if it has lots of reads, it can create metadata locking [16:13:53] I've wrote about that on previous alter table requests [16:14:16] ah hmm [16:14:25] I will provide you more info about that just in case you are interested [16:14:30] what it is more important [16:14:45] I will now check the size and the traffic of those tables [16:14:50] on all wikis [16:15:00] Ah OK [16:15:16] Actually this is a defect of CentralNotice: the log table is only actually used on meta [16:15:38] (It's pretty old code that we're cleaning up bit by bit) [16:15:40] if I do not see any issues, I can leave the alter running on the masters, and that's it [16:15:57] but I need a bit of time to discard that [16:16:45] but we recently had issues with one, and we had to do it slowly, on server at a time [16:17:18] so, my main point is that it is not guarantee, even they look like trivial changes like yours, ok? [16:17:33] it == the time it will take to be done [16:17:48] does it make sense? [16:17:53] yep! :) [16:18:14] ok, so let me do the checks, and I will go back to you in some minutes [16:18:20] K thanks! [16:23:17] CentralNotice is not on various wikis, actually. advisory, login, fiwikimedia, quality, ukwikimedia, vote, all fishbowl and all private wikis [16:23:35] ok [16:23:41] here it is the news [16:24:31] which happily makes your comment, Krenair, more reassuring [16:24:46] there is no cn_tables on production except on testwiki and metawiki [16:24:59] that simplifies the process a lot [16:26:03] So, the main issue I find with schema changes (which is why I want to so them carefuly) is that I do not understand 100% of the code, and many people do not undestand 100% of mysql [16:26:14] so we need to work toghether :-) [16:27:25] jynus: ah OK [16:27:48] Hmm ok good that the cn _tables are only on those wikis, that does make sense [16:27:55] so, when I see "apply to all databases", I think the table is on all databases [16:28:30] jynus: I thought it was, because I'm pretty sure the automatic Mediawiki updated applies it indiscriminately (/me checks) [16:28:52] let me check, maybe they are on the x1 server [16:29:55] if you assure me that Centralnotice is on enwiki, I can assure you that it is not on enwiki tables [16:30:17] jynus: CentralNotice has two modes of operation. One where it uses the tables, and one where it doesn't [16:30:26] that is ok [16:30:43] as I said, it simplifies the process a lot by not being there [16:30:46] AndyRussG: maintenance/update.php is not used on Wikimedia so it doesn't happen automatically [16:31:02] The built-in DB Mediawiki updater adds the tables indiscriminately no matter which mode is configured, which is why I thought the tables where everywhere... Yeah my bad, glad it simplifies things... [16:31:06] Not quite true Glaisher [16:31:13] It's not used in WMF production. Labs uses it [16:31:34] so, is it a problem not being there for mediawiki? [16:31:42] because we can fix that later [16:32:10] for me (maintenance) I prefer it as it is now [16:32:17] jynus: no, no problem. The only non-test wiki that uses the mode where it needs the tables ("infrastructure" mode) is meta [16:32:26] perfect them [16:32:54] then let me check the read traffic, and if it is ok, I will be able to do it then in 5 minutes :-) [16:32:58] CentralNotice is the system that displays banners, for Fundraising and community announcements and such. Wikis in "infrastructure" mode control the distribution of banners, and those in "subscribing" mode just display them [16:33:05] cool! [16:34:05] sorry my lack of knowledge of mediawiki- I have to spent so many hours with the databases, that I have no time to check the code a lot [16:34:17] that is why you have to treat me like a 5 year old [16:35:41] also, by heating up the table on cache, we will avoid lag issues [16:36:25] jynus: don't worry, I think I'd need the same approach when hearing about databases ;) [16:37:04] also, mediawiki code and production differs, sometime a lot :-). I need to double check [16:37:43] right! [16:38:01] Here is the WMF config where we set the variable that determins which mode a wiki is in: https://git.wikimedia.org/blob/operations%2Fmediawiki-config/7e787b51d0db64e4928fb581709f6ebed0e7fdfc/wmf-config%2FCommonSettings.php#L1446 [16:40:12] So that's only metawiki and testwiki [16:40:40] it is ok, I work with actual tables, so even if that was wrong I made sure to apply it to existing tables [16:41:02] ah right, makes sense [16:41:45] Glaisher: Reedy: thx btw :) [16:43:16] ok, see no high activity on that table, so that discards most metadata locking issues [16:44:34] cool! ¡descartado! [16:52:57] ok, already did the alter table, I only got 4 seconds of lag on the weakeast servers [16:53:24] I will be doing the create table, but those are trivial and can be done my everyone [16:53:29] *by [16:54:02] fantstic! [16:55:51] what is the policy for labs for those tables? should they be replicated at all? [16:56:58] jynus: they're already on the beta cluster. Yes they're used there, we have a beta copy of metawiki: meta.wikimedia.beta.wmflabs.org [16:57:14] (not sure if I answered your question...) [16:57:25] Nope [16:57:26] sorry, I didn't meant the beta cluster on labs [16:57:32] Do they want replicating to labs? [16:57:35] but the replicas we offer publicly [16:57:57] to almost anybody [16:57:58] I supose it is a no [16:59:28] show tables on metawiki_p doesn't show any cn table [17:00:12] I am trying to involve a bit more people on security, that is why I am asking [17:00:16] oh I see... yeah we don't have replicas there (pretty sure we could, though) [17:00:35] the main issue is if they contain private data [17:00:42] they should be filtered [17:01:14] so on schema changes I usually ask if they are going to contain private data, to make sure it doesn't even reach the labs hosts [17:01:56] ok, so that finishes it [17:02:09] let me comment on the ticket the final structure so it can be validated [17:02:13] and we are done [17:02:20] woohoo! \o/ [17:02:47] is T90915 the right ticket? [17:03:09] T110963 [17:03:13] probably [17:04:32] jynus: https://phabricator.wikimedia.org/T104508 [17:04:47] oh, there is another one :-) [17:05:05] jynus: https://phabricator.wikimedia.org/T110963 [17:05:19] yeah woops, I think you had the right one first [17:05:50] can you blame me for asking you questions, with so many tickets ;-) [17:06:00] heh certainly I cannot [17:06:17] Yea the other one I sent I guess was more general, covering labs and prod [17:07:01] indeed it's a bit confusing [17:07:49] https://phabricator.wikimedia.org/T110963#1593622 [17:07:58] So, please, please [17:08:30] next time you need a schema change, please tell us as much as possible in advance [17:08:39] even if the change is not yet firm [17:09:32] we never know when there is a spike on dba work, specially these days where i am a "single point of failure", until I can get some help [17:09:50] jynus: you bet! yeah I understand completely [17:09:53] and again, apologies if I was too "NO", at first [17:10:29] I will always be willing to help [17:10:46] no problem of course, also apologies that we reached out only at the last minute 8p [17:11:02] no problem, this things happen [17:11:07] *these [17:14:03] :) [17:27:08] This applies to much more than DBA work. Setting up new sites, for example... *grumbles* [17:29:38] It's the same for a lot of WMF requests [17:29:44] Everything is last minute, needs doing now, then isn't used [17:30:03] Krenair: fwiw, setting up new sites is something releng plans to have a quick turnaround for. [17:31:28] ostriches, wasn't that new wikis? [17:31:29] Yeah, I thought that was what you meant :) [17:31:29] lol [17:31:31] No, I was moaning about third party services for misc things [17:31:59] oh, ignore me [17:32:08] * ostriches goes back to coffee [17:32:52] ostriches, good luck with getting DNS+Puppet access btw [17:33:12] Well, new language projects don't need much (anything?) in the way of puppet. [17:33:16] It's just dns, really [17:33:33] Depends on the type of site [17:34:00] New special sites can probably wait [17:34:51] Didn't apache config used to be separate from puppet? [17:35:51] yes it did [17:36:32] ostriches, maybe you can get the ops duty person to perform these sorts of changes [17:36:40] I think it's missing the point though [17:38:05] It can take an unreasonably large amount of time to get ops to approve simple changes [17:38:40] puppetswat can least ensures that stuff doesn't simply get lost, assuming you sign up for it [17:38:56] sorry to say this- but might I ask you to continue discussing on -ops, or any other channel? I promised people to keep this just to contact us, and that people should not feel obligated to stay here except for DBA requests or db discussions. I apologize to all of for asking you this. [17:38:58] But I remain unconvinced that anyone actually watches the operations/puppet repo for new patches [17:39:04] right, sorry [17:39:47] I do not care about the occasional informal chat, though [17:40:06] if not, people will complain to me :-) [17:40:58] also, being selfish, helps keeping the logs short for me to check when I am not logged. Sorry again [17:41:51] thank you! [17:44:57] jynus: so... Percona live :) [17:45:10] (felt tempted to start an informal chat) [17:45:15] ha [17:46:22] 2 talks: one about query optimization that I intend to provide for mediawiki developers [17:46:39] and another about mysql at wiki(p|m)edia [17:47:18] https://www.percona.com/live/europe-amsterdam-2015/sessions/mysql-wikipedia-how-we-do-relational-data-wikimedia-foundation [17:47:56] it will be basically 50 minutes saying how bad cassandra is /s [17:50:43] 50 minutes of that? damn, wish I decided to go now :P [18:02:35] Sean is speaking as well? [18:02:36] probably not [18:03:16] (family is more important) [18:04:02] but I will be talking about 90% of the things he has done