[08:04:07] good morning [08:05:57] morning [08:08:55] for the p_s I've removed it, the only one that will change are es1 in eqiad that were not having it on. Was it off for some specific reason? [08:09:05] https://gerrit.wikimedia.org/r/#/c/288205/ [08:09:10] not really [08:10:44] I am thinking of changing parsercache to STATEMENT based replication (not temporarily, forever) to create active write MASTER-MASTERS [08:11:14] insert ignore/on duplicate key update? [08:11:20] REPLACE [08:11:33] which is what it is used now [08:11:54] instead of INSERT that gets "converted" on ROW [08:12:13] performance should not vary, but data drifts can happen [08:12:49] which I think everybody will be ok with it, as it is a cache [08:13:09] (assuming that the rest of the database has the same contents) [08:13:17] or eventually has the same contents [08:13:22] :-) [08:18:10] make sense, if adding both pc host per shard in the application config will add more HA to them too [08:19:40] not really HA [08:20:01] but being a cache, it is not a real db, meaning that reads generate writes [08:20:25] so right now, even depooled, errors are generated due to them being in read only [08:21:16] yes but if the app will go to the codfw one in case the eqiad one is depooled in the conf [08:22:16] or crashes, etc... [08:23:44] ? [08:24:05] you mean cross-datacenter queries? [08:26:00] I don't know exactly how MW uses the 3 shards pc1/2/3, but if pc1004 eqiad crashes it loose pc1 cache right? [08:27:14] right now yes [08:27:31] that is ok for a cache system (memcache-style) [08:27:46] but the issue is that key should be static [08:27:54] (sharding, I mean) [08:44:10] https://gerrit.wikimedia.org/r/#/c/287394 should be ready too, I'll apply live to all s* masters (eqiad+codfw) [09:27:55] ok, just log it [09:29:07] sure [09:40:31] actually, having most writes happen in less than 50ms is good news for galera [09:43:16] are you referring oto T131753 ? [09:43:17] T131753: Semi-synchronous replication status in all MySQL production clusters - https://phabricator.wikimedia.org/T131753 [09:54:04] yes [09:56:23] from where you get 50ms? [10:00:19] I think it is 100, I just assume "a normal distribution between 0 and 100", which doesn't have to be true [10:01:06] my point is "smaller than I thought" [10:01:59] I'm not sure is the same thing, because if I understand it correctly Rpl_semi_sync_master_tx_avg_wait_time should be a good indicator, but it's below 1ms [10:02:29] so I'm wondering that counts only the additional wait for the semi-sync, not the whole transaction time [10:02:34] but it's just a guess [10:03:08] yeah, actually semisync does not wait for it to be commited [10:03:19] only that it is on the binlog [10:03:32] so my assumptions are completelly wrong [10:03:46] nothing to do with write time + roundtrip [10:04:11] it's write time to relay log, not into mysql [10:05:17] yep [10:06:04] also in 5.6 it gets commited first and then on the binlog and not the other way round [11:00:13] for https://gerrit.wikimedia.org/r/#/c/288361 I would like to agree with you on a deploy strategy... it affects any DB that uses mariadb::config (almost all) [11:00:55] 1) disable puppet on all those hosts (assuming I can find them all), merge, force puppet on one slave on codfw, restart mysql there [11:01:44] 2) assuming the risk of a DB restart in a short period is very low, same of (1) without disabling puppet [11:06:04] yeah maybe 1 is not needed, but I would do it like this [11:06:23] because I would not want to be the guy that deletes all production databases [11:06:39] wait [11:06:42] restart? [11:06:49] ah, to test the config [11:06:57] ok, with that, but secondary [11:07:09] I can disable puppet on a host a do it manually (rm the link) [11:07:11] before merging [11:07:11] just enable it in batches, [11:07:24] see^ [11:08:20] definitely safer to disable puppet on a single host, do the unlink manually, restart mysql, if everything ok then can be merged [11:08:37] oh, I am not worried about the file [11:08:45] it is a config file [11:09:10] it is unexpected side effect due to a puppet bug or whatever [11:09:18] I know I am being paranoid [11:09:27] on this I fully agree [11:09:42] but how to enable it in batches? disable everywhere puppet and re-enable in batches? [11:09:47] I read a solved bug that deleted the linked file instead of the link [11:09:59] which was like o my god! [11:10:29] it is that or horrible unpuppetized rubish that somehow does something strange [11:10:59] oh my.... [11:11:03] horrible [11:11:12] so, because you are asking: disable puppet on all dbs, enable it on codfw, check a couple of hosts [11:11:29] enable it on half of eqiad, then the other half [11:11:40] ok, but I need first to find all hosts :) [11:11:58] db* is a good enough filter [11:12:10] didn't you create a salt thingy? [11:12:42] there are some old hosts still, but we do not care about those [11:12:51] this affects mariadb::config hence: all classes from mariadb.pp (misc, phabricator, eventlogging, beta, tendril, dbstore, analytics, coredb, sanitarium, wikitech, labs, parsercache, client) [11:13:11] + analytics_cluster::database::meta, labs::db::master, labs::db::slave, labs::dns ( on various files) [11:13:51] yes, but I would say only core and pc will cause outages [11:14:14] seek +1 from labs (chase? and analy otto?) [11:14:19] if you want [11:14:42] andrew maybe for silver [11:15:12] volans, ping [11:16:04] ok, let me make a list [11:16:19] hi Cyberpower678 [11:17:26] volans, hi. So how's this for the global table? http://pastebin.com/yh37P59C [11:18:06] Cyberpower678: I'm a bit busy with a delicate change right now, I'll take a look later if that's ok [11:18:27] volans, ok. I set the paste to expire in a day though. [11:19:06] no problem, I've opened it on a tab, hopefully my browser cache is longer :) [11:22:52] volans, oh and assume last_deadCheck is a BIGINT. [11:28:40] ack [12:40:47] volans, here's the local table [12:40:48] http://pastebin.com/N76SdMWN [13:08:52] db1054 issues [13:09:01] same as tonight? [13:10:17] seems ApiQueryRecentChanges::run [13:10:30] api, I would assume? [13:11:28] "Complex regular subexpression recursion limit (32766) exceeded at /usr/bin/pt-kill line 4772." [13:11:37] there you have the issue [13:14:53] apart a proper solution, as a workaround you can add a while true sleep 10 [13:18:27] yes [13:18:40] I should puppetize it [13:18:58] but again, I didn't want to because that is not a proper solution either [13:19:41] agree, the famous temporary-permanently-stabilized-workarounds ;) [13:24:51] https://tendril.wikimedia.org/host/view/db2040.codfw.wmnet/3306 ? [13:24:56] it's me [13:25:00] see SAL [13:25:00] ah [13:25:01] ok [13:25:03] np [13:25:28] it is the counter bug that confused me [13:26:03] yeah it always foul us [13:46:21] volans, I have a question for you. [13:46:34] I want to introduce a seperate paywall table. [13:46:53] It will only contain the domains of sites behind paywalls. [13:47:49] My question, is it better to determine if a URL is behind the paywall by creating a domain column in the global table and doing a JOIN, or to simply query for the presence of domains by parsing out the domain and running a seperate query? [13:49:49] why even save those URLs? if they are behind a paywall they're unarchivable so why not just skip them? [13:51:08] There's more to it [13:52:46] volans, The bot scans URLs to check to see if they are dead. [13:53:04] If they are a behind a paywall, the bot should know that. [13:53:16] So it saves paywalled domains. [13:53:34] yes, check if the domain is paywalled, if it is skip the url, don't even save it [13:53:48] How would that work. [13:54:09] The only known reliable way is to check for a subscription tag on a source,. [13:54:18] And not all sources are tagged with it. [13:54:43] depends if you get the URL from the page/api or your cache that is not up to date [13:55:30] The URL is retrieved via the page on Wikipedia [13:55:46] There's no good way to GET the URL and see if it's a paywall [13:56:02] So it's not as easy as simply skipping it. [13:56:48] no I'm saying, you get the URLs from wikipedia, then extract the domain, if it's in your paywalled list (db, application, cache in memory, wherever) you skip that url [13:56:48] If one source is tagged as a paywall, then it should flag the other sources as paywalled too. [13:57:21] We're talking past each other. [13:57:35] I'm asking from a DB query standpoint [13:57:51] Obviously what you described is what I'm going to do. [14:00:41] The original question when retrieving URLs from the cache is: Is it better to put the domain in the domain column of the respective URL and JOIN with the paywall table to flag it as a paywall, or simple load the URLs from the cache with one query, extract the domains from all the URLs. and query for the presence of those domains in the paywall table in a seperate query [14:02:48] volans, ^ [14:02:53] use always primary keys for joins and use ID for primary keys [14:03:45] Why is that? [14:04:06] Why can't indexed columns be used? [14:06:28] db1026 lag issues are probably caused by vslow/dump and rcs [14:08:46] ack [14:11:50] Cyberpower678: I'm saying use an id in the paywall table, use that id in the url table and do the join on that field, not the varchar field of the domain name [14:12:59] jynus: db2040 test went well, db restarted fine, show global variables are the same of a sibling [14:13:48] from my puppet-grep-foo I've extracted this list of hosts where to disable puppet to deploy it: [14:13:51] db*, dbstore*, es*, pc*, labsdb*, labservices*, silver, holmium [14:13:54] sounds reasonable? [14:14:10] (I'll check that it's not already disabled for any reason) [14:14:17] volans, thanks [14:16:36] so you can do something like select xxx from url left join paywall on paywall.id = url.paywall_id where pageid = 12345 [14:21:39] Cyberpower678 ^^^ [14:23:17] ok [14:30:42] I am going to enable GTID on x1 and crash those 2 servers I want to decom [14:32:20] ack [14:57:46] I'm delaying the deploy of the double my.cnf, better to do it on a EU morning, less people doing deploy and merging stuff on all those servers [14:58:58] sure, ok with that [14:59:26] what I usualy do is have like half a dozen pending changes [14:59:42] (although I am not sure it is by choice :-P) [14:59:53] and if I get blocker on one, I work on another [14:59:59] *d [15:00:15] I may have scared you too much now [15:01:15] ahahah no, it's not you, I don't like the workflow of deploying those kind of changes with puppet [15:01:28] I wish there was an easier way to say deploy only to those hosts... [15:01:55] it is true it is a common need [15:02:26] "test this only on a few hosts" [15:02:34] maybe using some sort of tagging for "canary" hosts [15:02:51] problem is canary for ops [15:03:05] it is not like mediawiki, where we actually have those [15:03:36] yes, canary in the sense of one per shard for example, preferably on the passive datacenter [15:03:37] maybe some kind of selective puppet-merge for only one host? [15:03:42] also [15:04:02] puppet-merge --only=db1026 [15:04:35] I am not sure how puppet master could allow that without having 2 services [15:04:54] but the need is real [15:05:28] maybe bring it up to ops, probably someone else already thought a way but didn't have the time to implement it [15:05:47] worth to try [15:06:47] I have good news [15:06:51] for a change [15:07:42] I can do STOP SLAVE; CHANGE MASTER TO MASTER_HOST='db1031.eqiad.wmnet'; START SLAVE; from a node to its granchild and it just works now [15:09:55] awesome! [15:10:35] no more --child-to sibling (it is faster and works at any level) [15:11:01] let's see now the transactional replication [15:12:25] also no more headaches with delayed slaves [15:18:51] there is one thing I feared [15:19:43] what? [15:19:45] killall -9 mysqld [15:19:49] This is cool: Slave I/O thread: connected to master 'repl@db2033.codfw.wmnet:3306',replication starts at GTID position '0-171970580-285632831' [15:19:51] but [15:20:02] [ERROR] mysqld: Table './heartbeat/heartbeat' is marked as crashed and should be repaired [15:20:08] ;-) [15:20:17] yes, myisam [15:21:02] GTID doesn't solve everything :) [15:21:15] oh, it will not solve labs issues [15:21:36] but this, combined with only-InnoDB tables [15:21:43] will solve most of issues [15:21:48] sure! [15:22:04] to the point that we could even not have to do skip-slave-start (not really)