[08:09:37] morning [08:18:50] hi [08:34:29] hi jynus: yesterday evening host db1026 was very laggy and it should be because of huge amounts of edits on wikidata [08:35:29] could be related to the spikes of api errors I saw on db1045 yesterday morning? [08:35:37] because of this dewiki database is affected, too [08:36:00] it is possible to separate wikidata database and dewiki databse? [08:37:06] i suggest that both are hosted on db1026, don't they? [08:41:18] that has been considered several times [08:41:36] but it is not easy [08:42:27] I would suggest, doctaxon, file a ticket with the issue (wikidata affecting dewiki- but only write your personal experience, suggest that as potential cause) [08:42:53] so write "read only mode on dewiki", only give that as a potential solution [08:43:03] make it blocked by the hardware upgrade [08:43:10] and add wikidata devs [08:43:24] and we will see what is the best course of action [08:43:37] what should i do? sorry ... i don't understand [08:43:37] maybe there is a simple softoware bug [08:44:09] jynus: FYI I'm about to log and proceed with T133780 deploy [08:44:10] doctaxon, do you have an account on phabricator? [08:44:19] yes [08:44:24] but [08:44:35] my english is not so good [08:44:41] ok [08:44:46] let me see if I can help [08:45:00] just a moment [08:45:52] yesterday 17:50 UTC until after 20:30 UTC dewiki database db1026 was very laggy [08:46:16] according to the config db1026 should not have any user's traffic under normal conditions [08:46:20] hoo suggested huge amounts of edits on wikidata [08:46:34] yes, that is my hypothesis [08:46:52] only special roles (watchlist, recentchanges, contributions, logpager; vslow, dump) [08:47:08] so it is possible to separate dewiki and wikidata database on host db1026 [08:47:17] or not? [08:47:40] doctaxon, is this what you noticed? https://phabricator.wikimedia.org/T123867 [08:48:07] doctaxon, it is possible, and it is one of the things that we have considered [08:48:51] T123867 is something old, isn't it? [08:49:03] but it is not immediate (we do not have right now the hardware, and even if we did, it could take a while to be done) [08:49:59] it needs to justify that this is an important issue and not happening occasionaly [08:50:12] jynus, there are running time based bots on dewiki, and laggy databases are not good, so it is a little immediate [08:50:38] doctaxon, is this what it is happening? https://phabricator.wikimedia.org/T135100 [08:51:06] doctaxon, I do not disagree, you do not have to convince me [08:51:14] I will reopen that [08:51:18] this was four days ago, ya [08:51:27] and please write why this is important for you [08:51:32] that will help [08:52:03] basically, tell (even in bad english) how it is affecting you [08:52:18] almost noone here is a native speaker [08:52:27] german? [08:52:30] ok [08:52:37] we can get it translated [08:52:40] if that is easier [08:52:54] did you reopen right now? [08:52:59] yes https://phabricator.wikimedia.org/T135100 [08:53:07] please give a summary of how that impacts you [08:53:20] you are jcrespo? [08:53:23] yes [08:53:32] nice to meet you [08:53:52] hi! [08:54:01] then you know what's going on [08:54:03] my hope is that the new servers [08:54:11] will mitigate the issue [08:54:22] that is why I blocked on it [08:54:59] however, dewiki and wiki data separation has been discussed at some point, but not planned [08:55:17] i'll try to summarize on phab [08:55:21] thanks [08:55:38] that will help justify it [08:56:05] it's still unbreak now! [08:56:31] or not? [08:58:34] priority is something that, if not set up by ourselves, we do not look much of, we value facts, like the ones you can provide [08:59:28] resharding is something that will take lots of weeks, so probably it should be fixed immediately in a different way [08:59:29] i didn't understand that [08:59:57] you mean the priority settings? [09:00:14] just please add (if you want, in german) your issues, and I will see what is the followup [09:00:24] ok? [09:01:07] forget about db1026, or dewiki- just explain your api issues [09:04:12] volans, resharding is something that would not be possible now, but maybe a hack could be done to separate dewiki and wikicommons on 2 different servers? [09:04:48] in any case, it still requires more servers, aka T133398 [09:05:04] jynus: you mean at application level? [09:05:17] yes [09:05:30] a bit hacky but surely could be done [09:05:58] easier than setup 3 servers that we do not currently have :-) [09:06:30] maybe at proxy level, with the things I have been reasearching last week [09:06:34] :-P [09:07:11] alternatively, throttle if someone is abusing the api [09:08:08] :) [09:10:19] jynus, i did the task edit [09:14:32] thank you, doctaxon I will check it soon [10:12:38] when you finish that, let's talk about priorities [10:12:50] (volans) [10:12:57] ok [13:19:26] volans, ping [13:20:13] CP678: Hi [13:22:46] volans, http://pastebin.com/ZLBsvqvi local table [13:26:46] volans, global table http://pastebin.com/qjnaRXMU [13:29:32] volans, paywall table [13:29:37] http://pastebin.com/jqV60XbL [13:31:25] CP678: you cannot add paywall_id on the externallinks_global table as NOT NULL, doesn't work and I would not ad it at all probably, depends on the logic of the application [13:31:49] ? [13:33:35] only few urls will be behind paywall [13:33:57] True. [13:34:29] But there comes the problem that the domain isn't yet flagged as a paywall [13:35:03] So when the paywall_id isn't set the first time around, it won't be set if it really turns out that it needs to be set. [13:36:50] So the paywall table will list all domains of any URL and have a paywall flag set. [13:37:04] 0 means no paywall and 1 means there is a paywall. [13:37:08] volans, ^ [13:39:04] so it's a domain table not a paywall one [13:39:33] It is now, but it only serves to indicate paywall status [13:43:16] ok [13:44:46] I can't just list paywalled domains anymore because if it turns out that a URL is a paywall after the fact the paywall id is now missing in future calls, which will default to 0 always. [13:45:32] So have the domain listed for every url and assigning a paywall id to both the url and domain, the two can always be connected in a single query. [13:46:08] Using the rule of thumb you gave me for joins. [13:55:19] volans, so do the tables need any other tweaks? [13:56:51] it's surely a big improvement [13:57:20] volans, will this also help reduce the stress on the DB when running selects, or doing inserts? [13:59:56] surely reduced, if enough to not harm the db it's hard to say beforehand and depends on the queries that the bot will do [14:01:11] Alrighty then. [14:03:15] Well the PC I am typing from needs a reboot. So I'm moving over to my laptop [14:14:59] volans, so can I migrate to this new structure now? [14:17:07] you can migrate your own bot to your new structure whenever you want, just be sure to have a good migration plan that doesn't harm too much the db :) [14:18:49] Well I was thinking of using a PHP script to copy the rows over. The DB, could use some cleanup in the process. The first step would be to populate the paywall table. [14:19:48] Then to populate the global table nesting a SELECT in an insert to set the paywall_id in the global table. [14:20:20] Then to populate the local table, nesting a SELECT in an insert to set the url_id in the local table. [14:20:44] The WHERE clause will use an indexed column. [14:21:02] volans, Thoughts? [14:21:38] Or is there perhaps some way to fill a row in all three tables with a single query? [14:22:08] I feel like there is, but my experience is a bit limited to know how. [14:26:34] sorry but I'm too busy to be able to evaluate which one is the best migration plan, it depends based on too many factors [14:32:36] ok [14:52:57] jynus: s4 api a bit overloaded [14:54:17] I see a higher error rate with a lot of ApiQueryPageImages::execute with error "MySQL server has gone away" [14:54:20] probably related to the ongoing video issues [14:57:18] I see some spike at 14:06 [14:57:37] were the deployed changes around that time? [14:59:07] my.cnf ones? [14:59:35] ended earlier but depends on puppet running on hosts, although I would tend to exclude them [14:59:56] or the change for the traffic issue [14:59:57] ? [14:59:58] volans, I will be beginning the migration soon [14:59:59] no, network [15:00:26] around 13:12 [15:01:18] https://logstash.wikimedia.org/#/dashboard/temp/AVS6FVvr0z-7ykXOtBB2 [15:01:21] what I was referring to [15:01:54] are you ok with https://gerrit.wikimedia.org/r/#/c/288420/ ? [15:08:22] I've thought about $master, and I think we should have 2 processes running, one on each datacenter, in (session) STATEMENT with half the frequency [15:08:33] for pt-heartbeat [15:08:52] that way we do not have to failover them [15:09:22] the only problem is we would be writing to a read-only host, and the relay log would not be empty [15:09:58] I agree with a solution that doesn't require change during failover, it will also check the replication in the other way around [15:10:28] yes, we can implement a monitor to check that both datacenters can see each other [15:10:42] and it avoids pt-heartbeat being a SPOF [15:10:57] if the master fails, it stills go to read only [15:11:05] but not if only heartbeat fails [15:11:32] but on datacenter failover, the only thing we do (and only just in case), is to set read_only on and off [15:11:34] yes [15:12:03] I wanted that in the beginning, but I wanted to see any potential issue [15:12:26] we have to check, however, that the lag check uses max() both on mediawiki and icinga [15:13:13] and there may be issues with GTID, but we will see those later