[01:45:44] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3699673 (10Legoktm) @Marostegui how would you suggest to move forward then? I saw your comment on T109179#3773638, but not sure wher... [06:16:44] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3776864 (10Marostegui) @Legoktm we can move forward with this task normally. We will just have to do other shards until we move back... [06:47:46] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Patch-For-Review: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3776902 (10Marostegui) [06:50:16] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Patch-For-Review: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3776904 (10Marostegui) 05Open>03Resolved This is all done now [07:26:50] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3776923 (10Marostegui) In which wiki are those 243,627 global edits? [07:27:43] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3776924 (10Marostegui) 05Open>03Resolved [08:11:53] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3776960 (10Marostegui) 05Open>03Resolved These two servers have been fully pooled in s5. [08:15:25] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3776967 (10Marostegui) db1101.s5 is now replicating and catching up [08:15:32] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3776968 (10Marostegui) [08:39:14] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and deploy schema change on dropping oresc_rev_predicted_model index - https://phabricator.wikimedia.org/T180045#3777003 (10Marostegui) I have dropped the index on `enwiki`on db1089... [09:28:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3777092 (10Marostegui) >>! In T180927#3774972, @Papaul wrote: > The ILO is up to date. I need to update the Storage and BIOS on the system but the Service pack disk that i have is old, the... [10:28:40] 10DBA, 10MediaWiki-Database: MySQL field aliases in select() do not use any quoting - https://phabricator.wikimedia.org/T105728#3777268 (10Marostegui) 05Open>03Resolved a:03jcrespo Closing this as it is quite old, there has been no activity and I guess there is not much else to do after Jaime's reply and... [10:34:24] did db2068 alter worked well? [10:34:42] sorry [10:34:47] not db2068 [10:35:12] that is down, pending a firmware upgrade [10:35:32] db2085 [10:39:47] db2085 still on going [10:39:57] I saw the alert, but no errors were reported [10:39:59] no, it is still ongoing, aparently [10:40:09] well, I attended the alert :-) [10:40:18] deleted a bunch of old binary logs [10:40:23] oh [10:40:24] you did? [10:40:26] ah [10:40:30] your APP woke you up [10:40:31] :( [10:40:32] 145G [10:40:36] wow [10:40:40] I think it will grow to 250 GB [10:40:46] down from 500GB [10:40:57] it was at 5% [10:41:02] it didn't wake me up [10:41:20] I was awake depooling ores from frwiki and dewiki [10:41:59] what?? [10:43:20] https://gerrit.wikimedia.org/r/#/c/392535/ [10:44:30] :( [10:45:47] 10DBA, 10MediaWiki-Database: MySQL field aliases in select() do not use any quoting - https://phabricator.wikimedia.org/T105728#3777312 (10jcrespo) 05Resolved>03Open This is a real bug, and just it is not a DBA issue, it is a mediawiki-database issue. [10:47:08] I do know know why close tickets that are correct, just not something we are going to work on [10:48:01] 2 years without any activity I assume it is not going to be ever worked on [10:48:07] really? [10:48:13] Specially if he edits the task and says that dbk works fine [10:48:18] But ok, my bad [10:48:36] that is not the task, did you read the summary? [10:49:01] He edited the task to say: dbk works fine [10:49:39] https://phabricator.wikimedia.org/T6715 [10:49:50] from 2006 [10:50:00] you are now doing the schema change to apply it [10:50:38] ok fair [10:50:55] that task is about missing quuoting, which could be a security issue [10:51:10] the particular problem was fixed, but that was not what the bug was about [10:52:01] you are right in that we may not ever work on it- in that case you just remove the DBA tag [10:52:11] but not delete a correct ticket [10:52:21] will do [10:52:38] people get angry if not [10:52:56] aaron will not but $random reporter will [10:54:39] I do believe we have to do a massive clean up on tickets, because it is impossible to know what needs/has to be worked on with almost 300 tickets over there [10:54:42] But oh well [10:56:36] you mean on DBA? [10:56:41] yep [10:56:54] well, that ones has disappeared from it [10:57:37] we have so many non-actionable tickets that are just sitting there as reminders [10:57:48] for example? [10:59:37] https://phabricator.wikimedia.org/T85266 https://phabricator.wikimedia.org/T138562 https://phabricator.wikimedia.org/T119626 [10:59:42] those three for example [10:59:56] good things to keep in mind, but do we use tickets as reminders? [10:59:58] ok, meta epic [11:00:21] precisely go there because they are not immediately actionable [11:00:25] or https://phabricator.wikimedia.org/T146821 [11:00:43] but I do not see what is the problem with them [11:00:52] we are working on backups [11:01:07] and it serves to have a look at actual actionables happening [11:01:26] you do not think we should not work on T119626 ? [11:01:28] T119626: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626 [11:01:40] we should, of course I think we should [11:01:53] the fact that we do not have time doesn't mean the problem disappears :-) [11:02:23] But we have that in mind every single day, I just think a ticket isn't the best solution or isn't giving anything extra here [11:02:35] I guess I am having an: shit-we-have-so-much-tickets kind of days [11:02:37] ok, I actually mentioned this before [11:02:39] It will go away, no worries ;) [11:02:43] the tasks vs bugs [11:03:11] but the decision was to keep using phabricator so we didn't have 20 different systems per group [11:05:47] ok T146821 is nonsense [11:05:49] T146821: Spike: Look into transaction isolation level and other tricks for easing db contention - https://phabricator.wikimedia.org/T146821 [11:06:05] but you do not go and just close it [11:06:28] you comment "this makes no sense to us", and then delete the DBA tag [11:06:35] ok [11:07:03] probably with better wording [11:07:27] I think your problem [11:07:49] is that people randomly add us to tickets "to be aware/give advice" [11:08:00] could be, yes [11:08:03] that is what the not dba/external [11:08:05] column is [11:08:13] for [11:08:28] but keeps growing and growing and it is a bit of a pain [11:08:37] you either remove dba or move things there, then don't look at it :-) [11:08:51] it never shrinks and sometimes it is hard to cope with that fact that tasks keeps piling up [11:08:56] ha ha [11:09:16] you have the existential crisis I had when you went onboard [11:09:45] that is anxiety, and it is not good [11:10:13] I cannot help much with that, but I can try [11:10:18] hehe [11:10:23] It will go away, no worries :) [11:10:56] DBA tasks are going down since you onboarded: https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-hwibeuyzizzy4xzunfsk [11:12:28] yeah, a bit [11:12:46] but still it is overwhelming to see it this size [11:13:06] 2) your job is not closing tasks, is to make mysql being up [11:13:59] it could be worse: 2) your job is not closing tasks, is to make mysql being up [11:14:04] https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-5hj6ygnanfu23mmnlvmd [11:14:14] haha [11:14:15] yeah [11:16:59] Anyways, I have to go, check my calendar ;-) [11:17:03] will be back in a bit [11:17:05] wish me luck [11:47:37] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3777529 (10jcrespo) @Legoktm it just slowed down the deploy, which means due to the fundraising we are not sure it can be completed... [11:56:58] 10DBA, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3777551 (10jcrespo) Really interesting patch- MySQL 8.0 will introduce SKIP LOCKED and... [11:57:24] hehe # DBA tasks exactly the same as when manuel joined [12:03:29] jynus / marostegui (when you're back): so mediawiki now has support for dynamic configuration, right [12:03:34] not deployed yet I believe but it has been written [12:03:47] do you think we can use that to do slave monitoring and depooling? [12:04:12] <_joe_> we need to do wome work on our side [12:04:26] <_joe_> in oder to be able to use that for db-$dc.php [12:04:53] mark: sure, that is one way to solve the issue, I think [12:05:03] <_joe_> well, we are technically able to do it, we're just not happy with the kind of validation we have right now [12:05:15] _joe_: But I thought mediawiki had problems with it, or those got solved? [12:05:27] mediawiki platform team worked on it last quarter [12:05:28] e.g. not respecting TTLs or timeouts? [12:05:46] <_joe_> sorry, bbl [12:06:03] mark: you know if those blockers got solved? [12:06:08] i am not sure [12:06:13] i will have a look at the task [12:06:18] but i don't know what the original blockers were [12:06:24] happened during my parental leave I think [12:06:30] joe and riccardo worked on it [12:06:49] it was going to be used for primary datacenter config, but they decided against [12:07:28] https://phabricator.wikimedia.org/T156924 [12:07:30] yeah [12:07:33] mark: https://phabricator.wikimedia.org/T156924#3269464 [12:07:33] and since it has been worked on [12:07:57] volans: are you aware of the state of issues you run into? [12:08:29] not up to date, no, I just know that some work was done in beta, I can check it [12:08:41] ok, we can ask [12:08:48] I see https://phabricator.wikimedia.org/T156924#3269464 [12:08:57] and that is in a way related to the mediawiki issues [12:09:01] *mysql-mediawiki [12:09:24] although it has some others with pileups that probably would be solved by centralicing monitoring there [12:10:16] mark: another reason why the issues are "new" is that there are now more or more powerful mediawiki application servers [12:11:08] but I think the checking model changed with pt-heartbeat which was needed for muti-dc [12:11:16] right [12:11:44] I mentioned the possiblity of handling connection externally- pybal, haproxy [12:11:50] not because I want it [12:12:06] but as an altenrative to reimplement old stuff on mediawiki itself [12:12:28] there's the (central?) monitoring component [12:12:33] and there's the connection routing component, I guess [12:12:48] how can mediawiki db config in etcd help the lag check? would it be moved outside of mediawiki and do a depool on etcd? [12:12:52] yeah, master-slave split and coordination can be stil be at mediawiki [12:12:58] volans: that's what I was thinking [12:13:26] bit the issue here is with slaves, which is in thory an easier task to acomplish [12:13:44] volans: mark: that is another question [12:13:59] should we mix dynamic config and dynamic state? [12:14:04] fair [12:14:22] in higher layers, etcd is for config but state is still controled by pyball, right? [12:14:27] mark: that requires to write some new stuff that does that in a reliable way, etc... [12:14:32] volans: yes ;) [12:14:46] i assume an already spawned mediawiki is not going to use updated etcd config [12:14:51] so this probably wouldn't solve everything [12:15:04] yes it would [12:15:16] does it get new config during a request? [12:15:19] i have no idea how it works [12:15:26] no, that is another issue with the model [12:15:37] once a request is ongoing, config is not reloaded [12:15:43] it doesn't require restart, not sure in the middle of a reques,t probably not [12:15:49] yeah ok [12:15:52] which was another reason to make the load balancing external [12:16:13] that one is a pain for failovers [12:16:14] but it doesn't create much availability issues [12:16:27] e.g. it would make read_only time smaller [12:16:35] what scenario are you talking about now? [12:16:49] sorry, I meant the config not being frequently reloaded [12:16:58] ok [12:17:15] we don't have any db proxies between mediawiki and dbs, right? [12:17:20] no [12:17:46] both Sean and I were thinking of using them at some point [12:18:06] and I mentioned some articles of how github does it [12:18:28] https://githubengineering.com/context-aware-mysql-pools-via-haproxy/ [12:19:10] obviously, we need discussion [12:19:35] yes [12:19:40] I mentioned to aaron and said that proxying (or some kind of monitoring/connection handling) outside would be needed at some point [12:19:45] a proxy solution would help control retries etc [12:19:51] and he sseemed open about it [12:20:31] the thing is, right now I do not know if we have "time" to do a large refactoring [12:20:57] as in, we could patch the largest issues, and then think about new parts [12:21:08] refactoring in mediawiki or outside of it? [12:21:23] it is more of a question- [12:21:44] if we can do a quick patch to solve most of the issues on my ticket [12:22:00] or if it is so difficult to solve all that we should go directly to a different model [12:22:07] it woud probably need a test setup and a lot of analysis [12:22:11] yep [12:22:27] mark: you have a lot of expertise on loadbalancing [12:22:37] pybal/lvs is rather different [12:22:42] I know [12:22:57] but I would like to know your input as the main issue are the non-state ones [12:23:21] but of course, there is statefull things to have into account- possibility of lagging, roles, etc. [12:23:27] yes [12:23:35] so, first step should probably be to gather data? [12:23:38] i don't know what currently exists [12:23:41] logstash errors [12:23:48] do we have grafana dashboards related to this? [12:23:52] lag per server etc [12:23:54] probably we do :) [12:23:57] yes [12:24:25] probably makes sense to figure out what the most common cases are going wrong [12:25:24] I think there are very similar to the ones I linked about (volans asked to test) [12:25:27] and then we have hhvm & php7 in the mix [12:25:28] fun [12:25:31] plus some extra ones [12:25:41] regarding lag checks stampedes [12:26:00] I might miss some historical background here, why was pybal discarded as an option for db load balancing? [12:26:00] and more ways in which a mysql can fail [12:26:10] volans: it wasn't discarded [12:26:15] it just it is not enough [12:26:16] i'm not sure pybal existed [12:26:28] so mediawiki db load balancing support goes back a long way [12:26:33] ok [12:26:35] i'm sure we had multiple servers back in 2004, or earlier [12:26:38] and pybal started existing in 2006 [12:26:39] as mark said, there is the routing and the logic [12:27:18] the logic, actually of the load balancer is ok [12:27:25] and yeah, pybal/lvs can't do things like retry on failure [12:27:42] ...or even observe failure on the connection itself [12:27:48] sure [12:27:51] which actually is nto a bad thing [12:27:57] I think based on logs [12:28:08] that trying to observe failire there is worse [12:28:20] well [12:28:25] becauese I think the issue (I have to prove it) [12:28:26] there is monitoring the state of databases [12:28:31] and there's reacting to your queries failing [12:28:42] is if the load balancer fails to detect the failure [12:28:47] mediawiki will of course need to continue doing the latter [12:28:56] each individual thread tries to do somthing about it [12:29:18] the problem is it retries to failed hosts [12:29:35] because probably it hasn't updated its config [12:29:55] so probably that is the root of pileups/timeouts [12:30:21] then we have the jobqueue, which tries very aggresively to rety queries [12:30:25] that would be "easy" to fix with a proxy solution [12:31:01] yeah, I think based on epirical results [12:31:14] is taht the main issue is dow servers or degraded are not fully depooled [12:31:27] which creates continously failing queries [12:31:31] which leads to more problems [12:31:55] that would explain why problems go away when we depool failing slaves [12:32:03] yeah [12:32:08] so the risk of "fixing" this in mediawiki [12:32:13] if we're sure that this is the problem, we could have mediawiki reload the config con DB failure once it's on etcd [12:32:14] is that it may behave differently again with php7 [12:32:26] volans: that would be worth investigating I guess [12:32:41] yeah, I think, assuming that hypothesis [12:33:03] that is the question- do we try to patch mediawiki or do a larger architectural change [12:33:16] which on the other wait, may be needed in the future anyway [12:33:25] for easier orchestration by ops [12:33:53] it is not that easy, though [12:34:15] there is the "monitoring check" is this server up and up to date [12:34:30] but there is another check which is "wait until this server catches up" [12:34:56] e.g. an edit will wait until the slave is up to date to serve data to user [12:35:09] that makes things complex [12:35:35] is that mediawiki looping over lag queries or something? [12:35:46] while (lag > X) noop; ? [12:36:00] it literally connects to mysql and perform a lock query until gtid position [12:36:05] ah [12:36:10] so the wait is inside mysql [12:36:19] that was one of the things that piled up last time [12:36:23] and that can give very long lived connections? [12:36:34] in theory they have a timeout [12:36:48] but my (mysql) query killer was killing 60 second waits [12:38:10] that part should be checked on why the timeout didn't work, or if it was very large, etc [12:38:36] this we could test in beta I assume [12:38:37] or why it was receiving new queries after it failed (jobqueue with outdated config=? [12:39:05] beta only has a master and a repliica, but yes, I do not see why it cannot be tested [12:39:15] obviously most of the issues are related to load [12:39:24] anomies problem was [12:39:42] that maybe the problem was because there is a check per server [12:39:56] and those can pileup because so many mediawiki servers [12:40:18] even if there is already some queing on the server itself [12:40:19] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3777679 (10Steinsplitter) >>! In T180946#3776923, @Marostegui wrote: > In which wiki are those 243,627 global edits? https://commons.wikimedia.org/wiki/Special:Centr... [12:40:25] right [12:40:40] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3777680 (10Steinsplitter) [12:40:52] probably introducing some proxies [12:40:55] starting simple [12:40:57] would not be a bad idea [12:41:01] although it creates another spof [12:41:02] wikidata edits are quite high and after all, they dispatch to all wikis, so everthing is possible [12:41:06] :-) [12:41:15] which then needs to be fronted by LVS I guess ;) [12:41:25] mark: I wanted your expertise in general, but it doesn't have to be now, of course [12:42:13] i am going to do some more tech work again soon [12:42:17] but probably not enough to solve this ;-) [12:42:25] I thought about that, but the proxies need to be synced to avoid split brains- although not a big deal to load bvalance reads [12:42:26] but happy to discuss, yes [12:42:36] yes yes [12:42:38] <3 statelss [12:43:02] but strangely, reads is what are failing now [12:44:10] could start simple [12:44:15] with a proxy on each slave [12:44:25] alternatively, we can put a proxy on each mediawiki server [12:44:35] that actually redirects to a designated backup when there are problems [12:44:37] that would also allow for validation of partial deployment [12:44:59] so no need to failover the proxy [12:45:00] that means that each client trying to use that (failing) slave is consistent [12:45:13] but of course, if the entire server with proxy fails, that's back to mediawiki [12:46:33] ok, you you are not load balancing there, your idea is only to failover [12:46:38] yes [12:46:51] i haven't thought this through yet [12:46:53] i'm just brainstorming [12:46:57] I know [12:46:58] so do I [12:47:06] ok, not fully onboard [12:47:14] because the other issue we had with mediawiki [12:47:17] (yes other) [12:47:27] is not behaving well with network drops [12:47:36] yeah [12:47:42] it seems timeouts with hhvm have not been reliable [12:47:42] last time when there was network maintenance [12:47:44] is my impression? [12:47:51] yes, you got what I meant [12:48:10] althought that is something it has to be solved yes or yes [12:48:18] but yeah if we're migrating away... [12:48:19] there will be always something to connect to that can fail [12:48:32] be it a proxy or a server or a middleware [12:48:33] problem with proxies on each mediawiki host is also consistency [12:48:40] I know [12:48:52] *but, for a read only service, it could be nough [12:49:10] the problem would be to load balance the masters [12:49:20] <_joe_> I am just done reading the backlog, I have some comments. [12:49:42] <_joe_> so as far as etcd-for-mediawiki goes, it's for dynamic configuration [12:50:01] I do not think is the way to go, I was just proposing it as a way to validate on only some servers [12:50:12] (answering to mark ^) [12:50:15] <_joe_> values are always going to be changed after some human intervention, or from a script [12:50:22] <_joe_> not from the software itself [12:50:51] <_joe_> that requires way more complexity than we've implemented [12:51:03] _joe_: so proxy still required, right? [12:51:04] i should read up on the etcd stuff [12:51:16] <_joe_> jynus: what do you mean? [12:51:19] and dynamic config should write proxy config [12:51:21] <_joe_> or better [12:51:37] not software should write dynamic config, is that what you mean? [12:51:39] proxy config should read etcd ;) [12:51:41] <_joe_> what problem do you want to solve specifically? because I can see many [12:52:00] _joe_: that is the problem, there is not a single patchable problem [12:52:09] <_joe_> one is "pooling/depooling a db requires a sofware deploy" [12:52:09] i propose we start investigating what the most common problems are [12:52:11] I think the current loadbalancer model is bad [12:52:27] try to isolate them from eachother, and then see what (simple) solutions we can come up with [12:52:31] with not a ton of work hopefully [12:52:33] <_joe_> another is "if one slave fails, we have an immediate degradation of service" [12:52:39] for now, I only want to solve one problem- which is the inestability that happens whtn a single replica has problems [12:52:46] trying to get this right from the start is probably unrealistic atm [12:52:58] now, maybe it makes no sense to do a small patch [12:53:05] <_joe_> jynus: so the latter problem I stated? [12:53:17] and do other things like migrating to dynamic config at the same time [12:53:47] deploy to change config is bad, but not urgent to be fixed [12:53:59] it caused delays, extended downtimes [12:54:12] but it does not causes outages, right? [12:54:27] _joe_: yes [12:54:36] i would agree that we should first make mediawiki behaviour to failing slaves more resilient [12:54:39] however, they are not completely isolated problems [12:54:45] and then automate/optimize config changes (including pool/depool) [12:54:54] <_joe_> the way I see it, that problem is not impossible to solve. My vision for that would be something like: a load-balancer / proxy that can depool servers that have a high lag temporarily, with some protection against herd effects [12:54:55] the latter is in scope for the multi-dc program though [12:55:20] let's say we setup a proxy- the proxy has to be controlled by something (even if that hasn't be there from the first day) [12:56:03] <_joe_> jynus: the proxy *configuration* as in *the list of pooled servers* is going to etcd [12:56:17] _joe_: your vision is exactly what I sent on the link: [12:56:23] <_joe_> but the actual state of each backend, such as its current lag or ability to accept connections [12:56:29] https://githubengineering.com/context-aware-mysql-pools-via-haproxy/ [12:56:33] <_joe_> it should be managed by the proxy [12:56:43] if pool goes before < n [12:57:00] it goes to a different proxy pool, allowing degraded replicas [12:57:34] simiarly, but not exacly what LVS does [12:57:37] <_joe_> I'm not sure I understand what you just said, I'll read the article [12:58:04] i will read it too [12:58:07] * volans lunch, bbiab [12:58:14] anyway, we cannot just say that a magical proxy will solve all problems [12:58:23] <_joe_> of course not [12:58:24] i would be happy with solving some problems ;) [12:58:27] ha ha [12:58:33] <_joe_> it can solve some, but it will create some more [12:58:40] I would take one of the test mysql hosts [12:58:43] the funny bit [12:58:50] <_joe_> like, how do we guarantee waitforslave() has any meaning? [12:58:51] is that we would probably have more control over the proxy (haproxy or other) [12:58:52] as ops [12:58:57] <_joe_> in mw context [12:58:59] which is sad considering we write mediawiki in house [12:59:01] _joe_: that was my question before [12:59:34] load balancer does more things on top of failing over [12:59:42] does replication control and other stuff [13:00:04] on the other side, it is the connection handling and pileups and timeuts that struggles with [13:00:20] it doesn't help HHVM/PHP has "bugs" [13:00:23] just migrate to php7 already [13:00:29] :) [13:00:32] let's test is already [13:00:48] did you had feedback that it made some things better? [13:00:55] or arou you joking? [13:01:11] i did not have any feedback [13:01:13] ok [13:01:16] i am partly joking [13:01:25] but yeah, that behaviour can/will change with that migration [13:01:27] I would like to have some feedback from mediawiki [13:01:32] developers [13:01:38] all two of them [13:01:42] I think we ops have a partial vision [13:01:47] and so do they [13:02:19] to wrap up my part of the conversation [13:02:37] <_joe_> so [13:02:42] <_joe_> I jut read the article [13:02:43] this is bad because a single server out of 100 lags or goes slow or gets down [13:02:48] <_joe_> I think it's a very good idea [13:02:52] _joe_: don't take it literally [13:02:59] but it is an example of how to go [13:03:10] <_joe_> jynus: well actually most of it is really appliable to our situation [13:03:30] <_joe_> jynus: and the main idea they had, ofc is that of having the "good"/"lagged" pools [13:03:33] goal for next quarter? ;-) [13:03:55] _joe_: I think you hated my idea of putting proxies on your mediawiki servers :-) [13:04:06] <_joe_> in that scenario, waitforslave() will DTRT in most cases, but I still have to consider the edge cases [13:04:45] other thing I saw [13:04:53] <_joe_> jynus: I still kinda do, but if we create a local http microservice that responds to the proxies and they don't all poll mysql continuously, it's better [13:04:53] now sure how was doing that [13:05:06] *who was doind that [13:05:12] <_joe_> the main problem is we will have to coordinate like 100s of haproxies [13:05:14] but some pileupes where solved [13:05:26] by creating a centralized wait system [13:05:50] so threads will announce the position they were waiting for and subscribed to a service [13:05:58] s/poolcounter/waiter/ [13:06:03] yes [13:06:06] but for databases [13:06:23] actually exactly that [13:06:33] but applied to replication contrl [13:06:39] yeah [13:07:05] and that is independent of failover/monitoring [13:07:20] can do both [13:07:32] <_joe_> btw, the way github implemented the whole thing is a bit raw, but the idea is not bad after all [13:07:35] _joe_: I talked to proxy sql developer [13:07:47] because of syncronizing setup [13:08:00] there is a libary for syncronized quorum [13:08:04] <_joe_> jynus: yeah I'd avoid that. [13:08:10] and he was going to implement it [13:08:15] ok [13:08:35] <_joe_> I would very much prefer to limit the shared state to a bare minimum [13:08:36] as I said, syncronicity is not a big deal for reads [13:08:42] <_joe_> yes [13:08:59] and we are not thinking for now to take over master issues [13:09:08] so that would be, in a way, a non-problem [13:09:52] if a single proxy goes bad and a single mediawiki /half the fleet is pointing to a wrong server because a glitch is not that bad [13:10:20] right now it takes a minute to deploy mysql changes [13:11:01] masters would probably a more centralized/redundant setup [13:11:09] when/if it happens [13:11:42] other people do hierarchies of proxies (not sure if we have such a large fleet to need that) [13:12:08] anyway, thanks for your input, going to lunch [13:12:16] <_joe_> one problem i see with haproxy is bandwidth. we'd need a pool of haproxies, hierarchies help with that too [13:12:32] bandwidth? really? [13:12:55] probably he means bandwidth not literally, asin conenction throughput [13:13:20] not bandwidth, mysql traffic and its checks would be really small [13:13:57] <_joe_> mysql traffic to the appservers is small indeed, I never really checked [13:14:21] yup [13:14:21] a proxy would in the future help with both TLS, connection pooling and other stuff on the bright side [13:14:28] <_joe_> I assumed it would be around 1 gb/s :) [13:14:32] nah [13:14:50] it only goes crazy with some particular kind of queries I will not mention publicly [13:14:56] <_joe_> yes jynus that's the kind of thing that makes me prefer a proxy to using pybal [13:15:15] <_joe_> be it haproxy or proxysql [13:15:41] i agree [13:15:55] pybal/lvs is great, but not for this [13:16:56] 14:14:27 <_joe_> I assumed it would be around 1 gb/s :) [13:17:00] what is the problem then :) [13:18:27] brb [13:29:02] * volans back and read the article [13:29:45] actually I now remember to have read it when it came out, it's part of my RSS feeds, and I din't like much the approach [13:30:11] in particular: [13:31:03] - no mention of an effective protection if the HTTP status server is not responding, seems that HA Proxy will effecively depool all of them (in this case I prefer Pybal approach) [13:31:57] - the status server checks mysql locally, good for lag and other stuff, might have false positive/negative for remote connections (routing, firewall, etc.) [13:32:44] <_joe_> volans: I think it was quite clear in the article that the "lagged" pool doesn't depool servers unless they respond 404 [13:33:13] <_joe_> I'm not sure what you mean with the second part [13:33:20] option httpchk GET /ignore-lag [13:33:26] they don't depool lagged servers [13:33:42] if the httpcheck fails (no reply, timeout) I think it will depool [13:34:27] I mean that if you deploy a firewall rule that blocks 3306 but not the port of the status server on the mysql host, it will still be pooled and traffic sent to it [13:34:33] while mysql is not reachable [13:34:58] <_joe_> so you suggest both ports should be checked [13:35:00] I don't recall how flexible is HAproxy with multiple checks in AND/OR [13:35:09] <_joe_> me neither [13:35:33] <_joe_> but I don't think we'd use their approach [13:35:57] <_joe_> btw, the reason why they use xinetd (which I hate) is exactly to solve the problem you were mentioning, I think [13:36:07] <_joe_> crashing xinetd is quite hard [13:37:29] sure, you can make the code that xinetd run fail though :D [14:02:39] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3777924 (10Marostegui) Thanks! When do you want to this? [14:08:10] is xinetd still a thing with systemd? [14:15:46] I didn't bring down db2068 because papaul asked me not to [14:16:11] as in, he would do it himself, not that it is a big deal [14:17:05] ah, well, i guess it is fine :) [14:24:19] marostegui: as you see above, we had a conversation and we have solved all wikimedia problems already [14:24:44] yeah, I was getting a tea and reading it XD [14:25:09] the bad news is that as there are no longer bugs or tasks to do, we have to fire you [14:30:40] marostegui: TL;DR: 42 (that's the answer) [14:32:54] I am going to install the proxies on mysql::clients [14:33:00] mariadb::clients [14:33:19] not only they will work for performance, we can also use them to test them ourselves [15:13:32] btw [15:13:35] the out of band on db1063 [15:13:37] has that been fixed? [15:13:41] any idea why it was broken? [15:13:55] that was fixed with the reboot I believe [15:14:43] We don't know why it got broken, I believe akosiaris had this nightmare about….what if all the ssh session were taken of the idrac monitoring? [15:14:51] But we don't really what happened with it [15:16:39] we have the "logical reason", as in, sda disappeared [15:17:05] not the real reason, but the most common cause is RAID failure in other hosts [15:17:11] ? [15:17:20] what does RAID have to do with .mgmt ? [15:17:29] oh, sory [15:17:39] ignore my comments [15:17:49] anyway, no I don't think we 've followed up on it [15:17:53] talking about other stuff [15:18:00] let's make that an actionable [15:18:17] in any case, we do not plan of pool it as master any time soon [15:19:25] I can connect now to it [15:19:34] yeah [15:19:39] I will check now db1070 and db1071 [15:19:41] but it's a bit worrying that ilo is not reliable [15:19:46] could happen on other hosts too [15:19:54] real and failover master [15:20:00] if monitoring makes it actually less reliable... hehe [15:20:13] in any case, the ILO slowed us down a bit in confusion [15:20:22] but it was not that critical [15:20:34] mark: i would be surprised if that was the case, I guess we should have seen it before, but who knows [15:20:37] ah, so you think monitoring could cause issues? [15:20:45] i wouldn't be suprised by that [15:20:52] those embedded systems often have small limits [15:20:55] there was a list of hosts with ILO problems [15:21:03] let me see if it was amont those [15:22:28] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3778177 (10jcrespo) @aaron the proxy is installed but unconfigured, - we still have... [15:23:13] I think I actually prefer having monitoring checking the virtual console and knowing there is an issue than the opposite [15:23:32] I think there is no coverage of that, as it is so unreliable [15:23:40] right? [15:23:59] yeah, there is only monitoring if the mgmt interface is up/down [15:24:16] this is what I wrote in the IR: • Maybe trying to monitor the idrac in a way that we can check if the log-in can actually happen [15:24:31] with "expect" or something, don't know, just an idea [15:25:32] bblack was proposing having remote power switches, too [15:26:31] these are the remote power switches :P [15:33:40] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3778201 (10Ottomata) [15:40:09] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3108353 (10Nettrom) The data behind [[ https://page-creation.wmflabs.org/#projects=nlwiki,eswiki,plwiki,itwiki,enwiki,jawiki,dewiki,svwiki,ruwik... [15:41:57] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3778229 (10elukey) >>! In T156844#3778225, @Nettrom wrote: > The data behind [[ https://page-creation.wmflabs.org/#projects=nlwiki,eswiki,plwiki... [15:47:51] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3778241 (10Nettrom) Nevermind, turns out @mforns has already updated that configuration, should've checked that first. Thanks again for taking c... [16:31:13] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3778434 (10Steinsplitter) 05Open>03Resolved a:03Steinsplitter Done. Thanks :) [16:49:49] disk space on db2085 seems more reasonable now [16:50:09] Yeah [16:50:14] it finished compressing s5 [16:50:17] now I am compressing s3 [16:50:44] I saw [16:51:04] are you compressing everthing, as there is no need for now of the server? [16:51:20] no, not everything just a selection of the biggest tables [16:51:25] ok [16:51:28] like templatelinks, text, revision etc [16:52:27] we should check other hosts and get rid of unused binlogs [16:52:41] yeah, I was planning on doing that too :) [16:52:48] not in a hurry, of course [16:52:59] but they end up adding up 100 or 200 GBs [16:53:43] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3778496 (10Papaul) a:05Papaul>03Marostegui Firmware update complete [16:55:14] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3778499 (10Marostegui) Thanks @Papaul - I will start mysql, let it run for the night and if all goes fine close this. If this breaks again, we can contact the vendor and see how to procee... [19:34:19] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3779106 (10jcrespo) Blocked on getting answers written at T175672#3778177.