[01:45:44] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3699673 (10Legoktm) @Marostegui how would you suggest to move forward then? I saw your comment on T109179#3773638, but not sure wher...
[06:16:44] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3776864 (10Marostegui) @Legoktm we can move forward with this task normally. We will just have to do other shards until we move back...
[06:47:46] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Patch-For-Review: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3776902 (10Marostegui)
[06:50:16] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Patch-For-Review: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3776904 (10Marostegui) 05Open>03Resolved This is all done now
[07:26:50] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3776923 (10Marostegui) In which wiki are those 243,627 global edits?
[07:27:43] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3776924 (10Marostegui) 05Open>03Resolved
[08:11:53] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3776960 (10Marostegui) 05Open>03Resolved These two servers have been fully pooled in s5.
[08:15:25] <wikibugs>	 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3776967 (10Marostegui) db1101.s5 is now replicating and catching up
[08:15:32] <wikibugs>	 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3776968 (10Marostegui)
[08:39:14] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and deploy schema change on dropping oresc_rev_predicted_model index - https://phabricator.wikimedia.org/T180045#3777003 (10Marostegui) I have dropped the index on `enwiki`on db1089...
[09:28:07] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3777092 (10Marostegui) >>! In T180927#3774972, @Papaul wrote: > The ILO is up to date. I need to update the Storage and BIOS on the system but the Service pack disk that i have is old, the...
[10:28:40] <wikibugs>	 10DBA, 10MediaWiki-Database: MySQL field aliases in select() do not use any quoting - https://phabricator.wikimedia.org/T105728#3777268 (10Marostegui) 05Open>03Resolved a:03jcrespo Closing this as it is quite old, there has been no activity and I guess there is not much else to do after Jaime's reply and...
[10:34:24] <jynus>	 did db2068 alter worked well?
[10:34:42] <jynus>	 sorry
[10:34:47] <jynus>	 not db2068
[10:35:12] <jynus>	 that is down, pending a firmware upgrade
[10:35:32] <jynus>	 db2085
[10:39:47] <marostegui>	 db2085 still on going
[10:39:57] <marostegui>	 I saw the alert, but no errors were reported
[10:39:59] <jynus>	 no, it is still ongoing, aparently
[10:40:09] <jynus>	 well, I attended the alert :-)
[10:40:18] <jynus>	 deleted a bunch of old binary logs
[10:40:23] <marostegui>	 oh
[10:40:24] <marostegui>	 you did?
[10:40:26] <marostegui>	 ah
[10:40:30] <marostegui>	 your APP woke you up
[10:40:31] <marostegui>	 :(
[10:40:32] <jynus>	 145G
[10:40:36] <marostegui>	 wow
[10:40:40] <jynus>	 I think it will grow to 250 GB
[10:40:46] <jynus>	 down from 500GB
[10:40:57] <jynus>	 it was at 5%
[10:41:02] <jynus>	 it didn't wake me up
[10:41:20] <jynus>	 I was awake depooling ores from frwiki and dewiki
[10:41:59] <marostegui>	 what??
[10:43:20] <jynus>	 https://gerrit.wikimedia.org/r/#/c/392535/
[10:44:30] <marostegui>	 :(
[10:45:47] <wikibugs>	 10DBA, 10MediaWiki-Database: MySQL field aliases in select() do not use any quoting - https://phabricator.wikimedia.org/T105728#3777312 (10jcrespo) 05Resolved>03Open This is a real bug, and just it is not a DBA issue, it is a mediawiki-database issue.
[10:47:08] <jynus>	 I do know know why close tickets that are correct, just not something we are going to work on
[10:48:01] <marostegui>	 2 years without any activity I assume it is not going to be ever worked on
[10:48:07] <jynus>	 really?
[10:48:13] <marostegui>	 Specially if he edits the task and says that dbk works fine
[10:48:18] <marostegui>	 But ok, my bad
[10:48:36] <jynus>	 that is not the task, did you read the summary?
[10:49:01] <marostegui>	 He edited the task to say: dbk works fine
[10:49:39] <jynus>	 https://phabricator.wikimedia.org/T6715
[10:49:50] <jynus>	 from 2006
[10:50:00] <jynus>	 you are now doing the schema change to apply it
[10:50:38] <marostegui>	 ok fair
[10:50:55] <jynus>	 that task is about missing quuoting, which could be a security issue
[10:51:10] <jynus>	 the particular problem was fixed, but that was not what the bug was about
[10:52:01] <jynus>	 you are right in that we may not ever work on it- in that case you just remove the DBA tag
[10:52:11] <jynus>	 but not delete a correct ticket
[10:52:21] <marostegui>	 will do
[10:52:38] <jynus>	 people get angry if not
[10:52:56] <jynus>	 aaron will not but $random reporter will
[10:54:39] <marostegui>	 I do believe we have to do a massive clean up on tickets, because it is impossible to know what needs/has to be worked on with almost 300 tickets over there
[10:54:42] <marostegui>	 But oh well
[10:56:36] <jynus>	 you mean on DBA?
[10:56:41] <marostegui>	 yep
[10:56:54] <jynus>	 well, that ones has disappeared from it
[10:57:37] <marostegui>	 we have so many non-actionable tickets that are just sitting there as reminders
[10:57:48] <jynus>	 for example?
[10:59:37] <marostegui>	 https://phabricator.wikimedia.org/T85266  https://phabricator.wikimedia.org/T138562 https://phabricator.wikimedia.org/T119626
[10:59:42] <marostegui>	 those three for example
[10:59:56] <marostegui>	 good things to keep in mind, but do we use tickets as reminders?
[10:59:58] <jynus>	 ok, meta epic
[11:00:21] <jynus>	 precisely go there because they are not immediately actionable
[11:00:25] <marostegui>	 or https://phabricator.wikimedia.org/T146821
[11:00:43] <jynus>	 but I do not see what is the problem with them
[11:00:52] <jynus>	 we are working on backups
[11:01:07] <jynus>	 and it serves to have a look at actual actionables happening
[11:01:26] <jynus>	 you do not think we should not work on T119626 ?
[11:01:28] <stashbot>	 T119626: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626
[11:01:40] <marostegui>	 we should, of course I think we should
[11:01:53] <jynus>	 the fact that we do not have time doesn't mean the problem disappears :-)
[11:02:23] <marostegui>	 But we have that in mind every single day, I just think a ticket isn't the best solution or isn't giving anything extra here
[11:02:35] <marostegui>	 I guess I am having an: shit-we-have-so-much-tickets kind of days
[11:02:37] <jynus>	 ok, I actually mentioned this before
[11:02:39] <marostegui>	 It will go away, no worries ;)
[11:02:43] <jynus>	 the tasks vs bugs
[11:03:11] <jynus>	 but the decision was to keep using phabricator so we didn't have 20 different systems per group
[11:05:47] <jynus>	 ok T146821 is nonsense
[11:05:49] <stashbot>	 T146821: Spike: Look into transaction isolation level and other tricks for easing db contention - https://phabricator.wikimedia.org/T146821
[11:06:05] <jynus>	 but you do not go and just close it
[11:06:28] <jynus>	 you comment "this makes no sense to us", and then delete the DBA tag
[11:06:35] <marostegui>	 ok
[11:07:03] <jynus>	 probably with better wording
[11:07:27] <jynus>	 I think your problem
[11:07:49] <jynus>	 is that people randomly add us to tickets "to be aware/give advice"
[11:08:00] <marostegui>	 could be, yes
[11:08:03] <jynus>	 that is what the not dba/external
[11:08:05] <jynus>	 column is
[11:08:13] <jynus>	 for
[11:08:28] <marostegui>	 but keeps growing and growing and it is a bit of a pain
[11:08:37] <jynus>	 you either remove dba or move things there, then don't look at it :-)
[11:08:51] <marostegui>	 it never shrinks and sometimes it is hard to cope with that fact that tasks keeps piling up
[11:08:56] <jynus>	 ha ha
[11:09:16] <jynus>	 you have the existential crisis I had when you went onboard
[11:09:45] <jynus>	 that is anxiety, and it is not good
[11:10:13] <jynus>	 I cannot help much with that, but I can try
[11:10:18] <marostegui>	 hehe
[11:10:23] <marostegui>	 It will go away, no worries :)
[11:10:56] <jynus>	 DBA tasks are going down since you onboarded: https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-hwibeuyzizzy4xzunfsk
[11:12:28] <marostegui>	 yeah, a bit
[11:12:46] <marostegui>	 but still it is overwhelming to see it this size
[11:13:06] <jynus>	 2) your job is not closing tasks, is to make mysql being up
[11:13:59] <jynus>	 it could be worse: 2) your job is not closing tasks, is to make mysql being up
[11:14:04] <jynus>	 https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-5hj6ygnanfu23mmnlvmd
[11:14:14] <marostegui>	 haha
[11:14:15] <marostegui>	 yeah
[11:16:59] <marostegui>	 Anyways, I have to go, check my calendar ;-)
[11:17:03] <marostegui>	 will be back in a bit
[11:17:05] <marostegui>	 wish me luck
[11:47:37] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3777529 (10jcrespo) @Legoktm it just slowed down the deploy, which means due to the fundraising we are not sure it can be completed...
[11:56:58] <wikibugs>	 10DBA, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3777551 (10jcrespo) Really interesting patch- MySQL 8.0 will introduce SKIP LOCKED and...
[11:57:24] <mark>	 hehe # DBA tasks exactly the same as when manuel joined
[12:03:29] <mark>	 jynus / marostegui (when you're back): so mediawiki now has support for dynamic configuration, right
[12:03:34] <mark>	 not deployed yet I believe but it has been written
[12:03:47] <mark>	 do you think we can use that to do slave monitoring and depooling?
[12:04:12] <_joe_>	 we need to do wome work on our side
[12:04:26] <_joe_>	 in oder to be able to use that for db-$dc.php
[12:04:53] <jynus>	 mark: sure, that is one way to solve the issue, I think
[12:05:03] <_joe_>	 well, we are technically able to do it, we're just not happy with the kind of validation we have right now
[12:05:15] <jynus>	 _joe_: But I thought mediawiki had problems with it, or those got solved?
[12:05:27] <mark>	 mediawiki platform team worked on it last quarter
[12:05:28] <jynus>	 e.g. not respecting TTLs or timeouts?
[12:05:46] <_joe_>	 sorry, bbl
[12:06:03] <jynus>	 mark: you know if those blockers got solved?
[12:06:08] <mark>	 i am not sure
[12:06:13] <mark>	 i will have a look at the task
[12:06:18] <mark>	 but i don't know what the original blockers were
[12:06:24] <mark>	 happened during my parental leave I think
[12:06:30] <jynus>	 joe and riccardo worked on it
[12:06:49] <jynus>	 it was going to be used for primary datacenter config, but they decided against
[12:07:28] <mark>	 https://phabricator.wikimedia.org/T156924
[12:07:30] <mark>	 yeah
[12:07:33] <volans>	 mark: https://phabricator.wikimedia.org/T156924#3269464
[12:07:33] <mark>	 and since it has been worked on
[12:07:57] <jynus>	 volans: are you aware of the state of issues you run into?
[12:08:29] <volans>	 not up to date, no, I just know that some work was done in beta, I can check it
[12:08:41] <jynus>	 ok, we can ask
[12:08:48] <jynus>	 I see https://phabricator.wikimedia.org/T156924#3269464
[12:08:57] <jynus>	 and that is in a way related to the mediawiki issues
[12:09:01] <jynus>	 *mysql-mediawiki
[12:09:24] <jynus>	 although it has some others with pileups that probably would be solved by centralicing monitoring there
[12:10:16] <jynus>	 mark: another reason why the issues are "new" is that there are now more or more powerful mediawiki application servers
[12:11:08] <jynus>	 but I think the checking model changed with pt-heartbeat which was needed for muti-dc
[12:11:16] <mark>	 right
[12:11:44] <jynus>	 I mentioned the possiblity of handling connection externally- pybal, haproxy
[12:11:50] <jynus>	 not because I want it
[12:12:06] <jynus>	 but as an altenrative to reimplement old stuff on mediawiki itself
[12:12:28] <mark>	 there's the (central?) monitoring component
[12:12:33] <mark>	 and there's the connection routing component, I guess
[12:12:48] <volans>	 how can mediawiki db config in etcd help the lag check? would it be moved outside of mediawiki and do a depool on etcd?
[12:12:52] <jynus>	 yeah, master-slave split and coordination can be stil be at mediawiki
[12:12:58] <mark>	 volans: that's what I was thinking
[12:13:26] <jynus>	 bit the issue here is with slaves, which is in thory an easier task to acomplish
[12:13:44] <jynus>	 volans: mark: that is another question
[12:13:59] <jynus>	 should we mix dynamic config and dynamic state?
[12:14:04] <mark>	 fair
[12:14:22] <jynus>	 in higher layers, etcd is for config but state is still controled by pyball, right?
[12:14:27] <volans>	 mark: that requires to write some new stuff that does that in a reliable way, etc...
[12:14:32] <mark>	 volans: yes ;)
[12:14:46] <mark>	 i assume an already spawned mediawiki is not going to use updated etcd config
[12:14:51] <mark>	 so this probably wouldn't solve everything
[12:15:04] <volans>	 yes it would
[12:15:16] <mark>	 does it get new config during a request?
[12:15:19] <mark>	 i have no idea how it works
[12:15:26] <jynus>	 no, that is another issue with the model
[12:15:37] <jynus>	 once a request is ongoing, config is not reloaded
[12:15:43] <volans>	 it doesn't require restart, not sure in the middle of a reques,t probably not
[12:15:49] <mark>	 yeah ok
[12:15:52] <jynus>	 which was another reason to make the load balancing external
[12:16:13] <jynus>	 that one is a pain for failovers
[12:16:14] <jynus>	 but it doesn't create much availability issues
[12:16:27] <jynus>	 e.g. it would make read_only time smaller
[12:16:35] <mark>	 what scenario are you talking about now?
[12:16:49] <jynus>	 sorry, I meant the config not being frequently reloaded
[12:16:58] <mark>	 ok
[12:17:15] <mark>	 we don't have any db proxies between mediawiki and dbs, right?
[12:17:20] <jynus>	 no
[12:17:46] <jynus>	 both Sean and I were thinking of using them at some point
[12:18:06] <jynus>	 and I mentioned some articles of how github does it
[12:18:28] <jynus>	 https://githubengineering.com/context-aware-mysql-pools-via-haproxy/
[12:19:10] <jynus>	 obviously, we need discussion
[12:19:35] <mark>	 yes
[12:19:40] <jynus>	 I mentioned to aaron and said that proxying (or some kind of monitoring/connection handling) outside would be needed at some point
[12:19:45] <mark>	 a proxy solution would help control retries etc
[12:19:51] <jynus>	 and he sseemed open about it
[12:20:31] <jynus>	 the thing is, right now I do not know if we have "time" to do a large refactoring
[12:20:57] <jynus>	 as in, we could patch the largest issues, and then think about new parts
[12:21:08] <mark>	 refactoring in mediawiki or outside of it?
[12:21:23] <jynus>	 it is more of a question-
[12:21:44] <jynus>	 if we can do a quick patch to solve most of the issues on my ticket
[12:22:00] <jynus>	 or if it is so difficult to solve all that we should go directly to a different model
[12:22:07] <mark>	 it woud probably need a test setup and a lot of analysis
[12:22:11] <jynus>	 yep
[12:22:27] <jynus>	 mark: you have a lot of expertise on loadbalancing
[12:22:37] <mark>	 pybal/lvs is rather different
[12:22:42] <jynus>	 I know
[12:22:57] <jynus>	 but I would like to know your input as the main issue are the non-state ones
[12:23:21] <jynus>	 but of course, there is statefull things to have into account- possibility of lagging, roles, etc.
[12:23:27] <mark>	 yes
[12:23:35] <mark>	 so, first step should probably be to gather data?
[12:23:38] <mark>	 i don't know what currently exists
[12:23:41] <mark>	 logstash errors
[12:23:48] <mark>	 do we have grafana dashboards related to this?
[12:23:52] <mark>	 lag per server etc
[12:23:54] <mark>	 probably we do :)
[12:23:57] <jynus>	 yes
[12:24:25] <mark>	 probably makes sense to figure out what the most common cases are going wrong
[12:25:24] <jynus>	 I think there are very similar to the ones I linked about (volans asked to test)
[12:25:27] <mark>	 and then we have hhvm & php7 in the mix
[12:25:28] <mark>	 fun
[12:25:31] <jynus>	 plus some extra ones
[12:25:41] <jynus>	 regarding lag checks stampedes
[12:26:00] <volans>	 I might miss some historical background here, why was pybal discarded as an option for db load balancing?
[12:26:00] <jynus>	 and more ways in which a mysql can fail
[12:26:10] <jynus>	 volans: it wasn't discarded
[12:26:15] <jynus>	 it just it is not enough
[12:26:16] <mark>	 i'm not sure pybal existed
[12:26:28] <mark>	 so mediawiki db load balancing support goes back a long way
[12:26:33] <volans>	 ok
[12:26:35] <mark>	 i'm sure we had multiple servers back in 2004, or earlier
[12:26:38] <mark>	 and pybal started existing in 2006
[12:26:39] <jynus>	 as mark said, there is the routing and the logic
[12:27:18] <jynus>	 the logic, actually of the load balancer is ok
[12:27:25] <mark>	 and yeah, pybal/lvs can't do things like retry on failure
[12:27:42] <mark>	 ...or even observe failure on the connection itself
[12:27:48] <volans>	 sure
[12:27:51] <jynus>	 which actually is nto a bad thing
[12:27:57] <jynus>	 I think based on logs
[12:28:08] <jynus>	  that trying to observe failire there is worse
[12:28:20] <mark>	 well
[12:28:25] <jynus>	 becauese I think the issue (I have to prove it)
[12:28:26] <mark>	 there is monitoring the state of databases
[12:28:31] <mark>	 and there's reacting to your queries failing
[12:28:42] <jynus>	 is if the load balancer fails to detect the failure
[12:28:47] <mark>	 mediawiki will of course need to continue doing the latter
[12:28:56] <jynus>	 each individual thread tries to do somthing about it
[12:29:18] <jynus>	 the problem is it retries to failed hosts 
[12:29:35] <jynus>	 because probably it hasn't updated its config
[12:29:55] <jynus>	 so probably that is the root of pileups/timeouts
[12:30:21] <jynus>	 then we have the jobqueue, which tries very aggresively to rety queries
[12:30:25] <mark>	 that would be "easy" to fix with a proxy solution
[12:31:01] <jynus>	 yeah, I think based on epirical results
[12:31:14] <jynus>	 is taht the main issue is dow servers or degraded are not fully depooled
[12:31:27] <jynus>	 which creates continously failing queries
[12:31:31] <jynus>	 which leads to more problems
[12:31:55] <jynus>	 that would explain why problems go away when we depool failing slaves
[12:32:03] <mark>	 yeah
[12:32:08] <mark>	 so the risk of "fixing" this in mediawiki
[12:32:13] <volans>	 if we're sure that this is the problem, we could have mediawiki reload the config con DB failure once it's on etcd
[12:32:14] <mark>	 is that it may behave differently again with php7
[12:32:26] <mark>	 volans: that would be worth investigating I guess
[12:32:41] <jynus>	 yeah, I think, assuming that hypothesis
[12:33:03] <jynus>	 that is the question- do we try to patch mediawiki or do a larger architectural change
[12:33:16] <jynus>	 which on the other wait, may be needed in the future anyway
[12:33:25] <jynus>	 for easier orchestration by ops
[12:33:53] <jynus>	 it is not that easy, though
[12:34:15] <jynus>	 there is the "monitoring check" is this server up and up to date
[12:34:30] <jynus>	 but there is another check which is "wait until this server catches up"
[12:34:56] <jynus>	 e.g. an edit will wait until the slave is up to date to serve data to user
[12:35:09] <jynus>	 that makes things complex
[12:35:35] <mark>	 is that mediawiki looping over lag queries or something?
[12:35:46] <mark>	 while (lag > X) noop; ?
[12:36:00] <jynus>	 it literally connects to mysql and perform a lock query until gtid position
[12:36:05] <mark>	 ah
[12:36:10] <mark>	 so the wait is inside mysql
[12:36:19] <jynus>	 that was one of the things that piled up last time
[12:36:23] <mark>	 and that can give very long lived connections?
[12:36:34] <jynus>	 in theory they have a timeout
[12:36:48] <jynus>	 but my (mysql) query killer was killing 60 second waits
[12:38:10] <jynus>	 that part should be checked on why the timeout didn't work, or if it was very large, etc
[12:38:36] <mark>	 this we could test in beta I assume
[12:38:37] <jynus>	 or why it was receiving new queries after it failed (jobqueue with outdated config=?
[12:39:05] <jynus>	 beta only has a master and a repliica, but yes, I do not see why it cannot be tested
[12:39:15] <jynus>	 obviously most of the issues are related to load
[12:39:24] <jynus>	 anomies problem was
[12:39:42] <jynus>	 that maybe the problem was because there is a check per server
[12:39:56] <jynus>	 and those can pileup because so many mediawiki servers
[12:40:18] <jynus>	 even if there is already some queing on the server itself
[12:40:19] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3777679 (10Steinsplitter) >>! In T180946#3776923, @Marostegui wrote: > In which wiki are those 243,627 global edits?  https://commons.wikimedia.org/wiki/Special:Centr...
[12:40:25] <mark>	 right
[12:40:40] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3777680 (10Steinsplitter)
[12:40:52] <mark>	 probably introducing some proxies
[12:40:55] <mark>	 starting simple
[12:40:57] <mark>	 would not be a bad idea
[12:41:01] <mark>	 although it creates another spof
[12:41:02] <jynus>	 wikidata edits are quite high and after all, they dispatch to all wikis, so everthing is possible
[12:41:06] <jynus>	 :-)
[12:41:15] <mark>	 which then needs to be fronted by LVS I guess ;)
[12:41:25] <jynus>	 mark: I wanted your expertise in general, but it doesn't have to be now, of course
[12:42:13] <mark>	 i am going to do some more tech work again soon
[12:42:17] <mark>	 but probably not enough to solve this ;-)
[12:42:25] <jynus>	 I thought about that, but the proxies need to be synced to avoid split brains- although not a big deal to load bvalance reads
[12:42:26] <mark>	 but happy to discuss, yes
[12:42:36] <mark>	 yes yes
[12:42:38] <mark>	 <3 statelss
[12:43:02] <jynus>	 but strangely, reads is what are failing now
[12:44:10] <mark>	 could start simple
[12:44:15] <mark>	 with a proxy on each slave
[12:44:25] <jynus>	 alternatively, we can put a proxy on each mediawiki server
[12:44:35] <mark>	 that actually redirects to a designated backup when there are problems
[12:44:37] <jynus>	 that would also allow for validation of partial deployment
[12:44:59] <jynus>	 so no need to failover the proxy
[12:45:00] <mark>	 that means that each client trying to use that (failing) slave is consistent
[12:45:13] <mark>	 but of course, if the entire server with proxy fails, that's back to mediawiki
[12:46:33] <jynus>	 ok, you you are not load balancing there, your idea is only to failover
[12:46:38] <mark>	 yes
[12:46:51] <mark>	 i haven't thought this through yet
[12:46:53] <mark>	 i'm just brainstorming
[12:46:57] <jynus>	 I know
[12:46:58] <jynus>	 so do I
[12:47:06] <jynus>	 ok, not fully onboard
[12:47:14] <jynus>	 because the other issue we had with mediawiki
[12:47:17] <jynus>	 (yes other)
[12:47:27] <jynus>	 is not behaving well with network drops
[12:47:36] <mark>	 yeah
[12:47:42] <mark>	 it seems timeouts with hhvm have not been reliable
[12:47:42] <jynus>	 last time when there was network maintenance
[12:47:44] <mark>	 is my impression?
[12:47:51] <jynus>	 yes, you got what I meant
[12:48:10] <jynus>	 althought that is something it has to be solved yes or yes
[12:48:18] <mark>	 but yeah if we're migrating away...
[12:48:19] <jynus>	 there will be always something to connect to that can fail
[12:48:32] <jynus>	 be it a proxy or a server or a middleware
[12:48:33] <mark>	 problem with proxies on each mediawiki host is also consistency
[12:48:40] <jynus>	 I know
[12:48:52] <jynus>	 *but, for a read only service, it could be nough
[12:49:10] <jynus>	 the problem would be to load balance the masters
[12:49:20] <_joe_>	 I am just done reading the backlog, I have some comments.
[12:49:42] <_joe_>	 so as far as etcd-for-mediawiki goes, it's for dynamic configuration
[12:50:01] <jynus>	 I do not think is the way to go, I was just proposing it as a way to validate on only some servers
[12:50:12] <jynus>	 (answering to mark ^)
[12:50:15] <_joe_>	 values are always going to be changed after some human intervention, or from a script
[12:50:22] <_joe_>	 not from the software itself
[12:50:51] <_joe_>	 that requires way more complexity than we've implemented
[12:51:03] <jynus>	 _joe_: so proxy still required, right?
[12:51:04] <mark>	 i should read up on the etcd stuff
[12:51:16] <_joe_>	 jynus: what do you mean?
[12:51:19] <jynus>	 and dynamic config should write proxy config
[12:51:21] <_joe_>	 or better
[12:51:37] <jynus>	 not software should write dynamic config, is that what you mean?
[12:51:39] <mark>	 proxy config should read etcd ;)
[12:51:41] <_joe_>	 what problem do you want to solve specifically? because I can see many
[12:52:00] <jynus>	 _joe_: that is the problem, there is not a single patchable problem
[12:52:09] <_joe_>	 one is "pooling/depooling a db requires a sofware deploy"
[12:52:09] <mark>	 i propose we start investigating what the most common problems are
[12:52:11] <jynus>	 I think the current loadbalancer model is bad
[12:52:27] <mark>	 try to isolate them from eachother, and then see what (simple) solutions we can come up with
[12:52:31] <mark>	 with not a ton of work hopefully
[12:52:33] <_joe_>	 another is "if one slave fails, we have an immediate degradation of service"
[12:52:39] <jynus>	 for now, I only want to solve one problem- which is the inestability that happens whtn a single replica has problems
[12:52:46] <mark>	 trying to get this right from the start is probably unrealistic atm
[12:52:58] <jynus>	 now, maybe it makes no sense to do a small patch
[12:53:05] <_joe_>	 jynus: so the latter problem I stated?
[12:53:17] <jynus>	 and do other things like migrating to dynamic config at the same time
[12:53:47] <jynus>	 deploy to change config is bad, but not urgent to be fixed
[12:53:59] <jynus>	 it caused delays, extended downtimes
[12:54:12] <jynus>	 but it does not causes outages, right?
[12:54:27] <jynus>	 _joe_: yes
[12:54:36] <mark>	 i would agree that we should first make mediawiki behaviour to failing slaves more resilient
[12:54:39] <jynus>	 however, they are not completely isolated problems
[12:54:45] <mark>	 and then automate/optimize config changes (including pool/depool)
[12:54:54] <_joe_>	 the way I see it, that problem is not impossible to solve. My vision for that would be something like: a load-balancer / proxy that can depool servers that have a high lag temporarily, with some protection against herd effects
[12:54:55] <mark>	 the latter is in scope for the multi-dc program though
[12:55:20] <jynus>	 let's say we setup a proxy- the proxy has to be controlled by something (even if that hasn't be there from the first day)
[12:56:03] <_joe_>	 jynus: the proxy *configuration* as in *the list of pooled servers* is going to etcd
[12:56:17] <jynus>	 _joe_: your vision is exactly what I sent on the link:
[12:56:23] <_joe_>	 but the actual state of each backend, such as its current lag or ability to accept connections
[12:56:29] <jynus>	 https://githubengineering.com/context-aware-mysql-pools-via-haproxy/
[12:56:33] <_joe_>	 it should be managed by the proxy
[12:56:43] <jynus>	 if pool goes before < n
[12:57:00] <jynus>	 it goes to a different proxy pool, allowing degraded replicas
[12:57:34] <jynus>	 simiarly, but not exacly what LVS does
[12:57:37] <_joe_>	 I'm not sure I understand what you just said, I'll read the article
[12:58:04] <mark>	 i will read it too
[12:58:07] * volans lunch, bbiab
[12:58:14] <jynus>	 anyway, we cannot just say that a magical proxy will solve all problems
[12:58:23] <_joe_>	 of course not
[12:58:24] <mark>	 i would be happy with solving some problems ;)
[12:58:27] <jynus>	 ha ha
[12:58:33] <_joe_>	 it can solve some, but it will create some more
[12:58:40] <jynus>	 I would take one of the test mysql hosts
[12:58:43] <mark>	 the funny bit
[12:58:50] <_joe_>	 like, how do we guarantee waitforslave() has any meaning?
[12:58:51] <mark>	 is that we would probably have more control over the proxy (haproxy or other)
[12:58:52] <mark>	 as ops
[12:58:57] <_joe_>	 in mw context
[12:58:59] <mark>	 which is sad considering we write mediawiki in house
[12:59:01] <jynus>	 _joe_: that was my question before
[12:59:34] <jynus>	 load balancer does more things on top of failing over
[12:59:42] <jynus>	 does replication control and other stuff
[13:00:04] <jynus>	 on the other side, it is the connection handling and pileups and timeuts that struggles with
[13:00:20] <jynus>	 it doesn't help HHVM/PHP has "bugs"
[13:00:23] <mark>	 just migrate to php7 already
[13:00:29] <mark>	 :)
[13:00:32] <jynus>	 let's test is already
[13:00:48] <jynus>	 did you had feedback that it made some things better?
[13:00:55] <jynus>	 or arou you joking?
[13:01:11] <mark>	 i did not have any feedback
[13:01:13] <jynus>	 ok
[13:01:16] <mark>	 i am partly joking
[13:01:25] <mark>	 but yeah, that behaviour can/will change with that migration
[13:01:27] <jynus>	 I would like to have some feedback from mediawiki
[13:01:32] <jynus>	 developers
[13:01:38] <mark>	 all two of them
[13:01:42] <jynus>	 I think we ops have a partial vision
[13:01:47] <jynus>	 and so do they
[13:02:19] <jynus>	 to wrap up my part of the conversation
[13:02:37] <_joe_>	 so
[13:02:42] <_joe_>	 I jut read the article
[13:02:43] <jynus>	 this is bad because a single server out of 100 lags or goes slow or gets down
[13:02:48] <_joe_>	 I think it's a very good idea
[13:02:52] <jynus>	 _joe_: don't take it literally
[13:02:59] <jynus>	 but it is an example of how to go
[13:03:10] <_joe_>	 jynus: well actually most of it is really appliable to our situation
[13:03:30] <_joe_>	 jynus: and the main idea they had, ofc is that of having the "good"/"lagged" pools
[13:03:33] <mark>	 goal for next quarter? ;-)
[13:03:55] <jynus>	 _joe_: I think you hated my idea of putting proxies on your mediawiki servers :-)
[13:04:06] <_joe_>	 in that scenario, waitforslave() will DTRT in most cases, but I still have to consider the edge cases
[13:04:45] <jynus>	 other thing I saw
[13:04:53] <_joe_>	 jynus: I still kinda do, but if we create a local http microservice that responds to the proxies and they don't all poll mysql continuously, it's better
[13:04:53] <jynus>	 now sure how was doing that
[13:05:06] <jynus>	 *who was doind that
[13:05:12] <_joe_>	 the main problem is we will have to coordinate like 100s of haproxies
[13:05:14] <jynus>	 but some pileupes where solved
[13:05:26] <jynus>	 by creating a centralized wait system
[13:05:50] <jynus>	 so threads will announce the position they were waiting for and subscribed to a service
[13:05:58] <mark>	 s/poolcounter/waiter/
[13:06:03] <jynus>	 yes
[13:06:06] <jynus>	 but for databases
[13:06:23] <jynus>	 actually exactly that
[13:06:33] <jynus>	 but applied to replication contrl
[13:06:39] <mark>	 yeah
[13:07:05] <jynus>	 and that is independent of failover/monitoring
[13:07:20] <mark>	 can do both
[13:07:32] <_joe_>	 btw, the way github implemented the whole thing is a bit raw, but the idea is not bad after all
[13:07:35] <jynus>	 _joe_: I talked to proxy sql developer
[13:07:47] <jynus>	 because of syncronizing setup
[13:08:00] <jynus>	 there is a libary for syncronized quorum
[13:08:04] <_joe_>	 jynus: yeah I'd avoid that.
[13:08:10] <jynus>	 and he was going to implement it
[13:08:15] <jynus>	 ok
[13:08:35] <_joe_>	 I would very much prefer to limit the shared state to a bare minimum
[13:08:36] <jynus>	 as I said, syncronicity is not a big deal for reads
[13:08:42] <_joe_>	 yes
[13:08:59] <jynus>	 and we are not thinking for now to take over master issues
[13:09:08] <jynus>	 so that would be, in a way, a non-problem
[13:09:52] <jynus>	 if a single proxy goes bad and a single mediawiki /half the fleet is pointing to a wrong server because a glitch is not that bad
[13:10:20] <jynus>	 right now it takes a minute to deploy mysql changes
[13:11:01] <jynus>	 masters would probably a more centralized/redundant setup
[13:11:09] <jynus>	 when/if it happens
[13:11:42] <jynus>	 other people do hierarchies of proxies (not sure if we have such a large fleet to need that)
[13:12:08] <jynus>	 anyway, thanks for your input, going to lunch
[13:12:16] <_joe_>	 one problem i see with haproxy is bandwidth. we'd need a pool of haproxies, hierarchies help with that too
[13:12:32] <mark>	 bandwidth? really?
[13:12:55] <jynus>	 probably he means bandwidth not literally, asin conenction throughput
[13:13:20] <jynus>	 not bandwidth, mysql traffic and its checks would be really small
[13:13:57] <_joe_>	 mysql traffic to the appservers is small indeed, I never really checked
[13:14:21] <mark>	 yup
[13:14:21] <jynus>	 a proxy would in the future help with both TLS, connection pooling and other stuff on the bright side
[13:14:28] <_joe_>	 I assumed it would be around 1 gb/s :)
[13:14:32] <jynus>	 nah
[13:14:50] <jynus>	 it only goes crazy with some particular kind of queries I will not mention publicly
[13:14:56] <_joe_>	 yes jynus that's the kind of thing that makes me prefer a proxy to using pybal
[13:15:15] <_joe_>	 be it haproxy or proxysql
[13:15:41] <mark>	 i agree
[13:15:55] <mark>	 pybal/lvs is great, but not for this
[13:16:56] <mark>	 14:14:27 <_joe_> I assumed it would be around 1 gb/s :)
[13:17:00] <mark>	 what is the problem then :)
[13:18:27] <mark>	 brb
[13:29:02] * volans back and read the article
[13:29:45] <volans>	 actually I now remember to have read it when it came out, it's part of my RSS feeds, and I din't like much the approach
[13:30:11] <volans>	 in particular:
[13:31:03] <volans>	 - no mention of an effective protection if the HTTP status server is not responding, seems that HA Proxy will effecively depool all of them (in this case I prefer Pybal approach)
[13:31:57] <volans>	 - the status server checks mysql locally, good for lag and other stuff, might have false positive/negative for remote connections (routing, firewall, etc.)
[13:32:44] <_joe_>	 volans: I think it was quite clear in the article that the "lagged" pool doesn't depool servers unless they respond 404
[13:33:13] <_joe_>	 I'm not sure what you mean with the second part
[13:33:20] <volans>	 option httpchk GET /ignore-lag
[13:33:26] <volans>	 they don't depool lagged servers
[13:33:42] <volans>	 if the httpcheck fails (no reply, timeout) I think it will depool
[13:34:27] <volans>	 I mean that if you deploy a firewall rule that blocks 3306 but not the port of the status server on the mysql host, it will still be pooled and traffic sent to it
[13:34:33] <volans>	 while mysql is not reachable
[13:34:58] <_joe_>	 so you suggest both ports should be checked
[13:35:00] <volans>	 I don't recall how flexible is HAproxy with multiple checks in AND/OR
[13:35:09] <_joe_>	 me neither
[13:35:33] <_joe_>	 but I don't think we'd use their approach
[13:35:57] <_joe_>	 btw, the reason why they use xinetd (which I hate) is exactly to solve the problem you were mentioning, I think
[13:36:07] <_joe_>	 crashing xinetd is quite hard
[13:37:29] <volans>	 sure, you can make the code that xinetd run fail though :D
[14:02:39] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3777924 (10Marostegui) Thanks! When do you want to this?
[14:08:10] <jynus>	 is xinetd still a thing with systemd?
[14:15:46] <jynus>	 I didn't bring down db2068 because papaul asked me not to
[14:16:11] <jynus>	 as in, he would do it himself, not that it is a big deal
[14:17:05] <marostegui>	 ah, well, i guess it is fine :)
[14:24:19] <jynus>	 marostegui: as you see above, we had a conversation and we have solved all wikimedia problems already
[14:24:44] <marostegui>	 yeah, I was getting a tea and reading it XD
[14:25:09] <jynus>	 the bad news is that as there are no longer bugs or tasks to do, we have to fire you
[14:30:40] <volans>	 marostegui: TL;DR: 42 (that's the answer)
[14:32:54] <jynus>	 I am going to install the proxies on mysql::clients
[14:33:00] <jynus>	 mariadb::clients
[14:33:19] <jynus>	 not only they will work for performance, we can also use them to test them ourselves
[15:13:32] <mark>	 btw
[15:13:35] <mark>	 the out of band on db1063
[15:13:37] <mark>	 has that been fixed?
[15:13:41] <mark>	 any idea why it was broken?
[15:13:55] <marostegui>	 that was fixed with the reboot I believe
[15:14:43] <marostegui>	 We don't know why it got broken, I believe akosiaris had this nightmare about….what if all the ssh session were taken of the idrac monitoring?
[15:14:51] <marostegui>	 But we don't really what happened with it
[15:16:39] <jynus>	 we have the "logical reason", as in, sda disappeared
[15:17:05] <jynus>	 not the real reason, but the most common cause is RAID failure in other hosts
[15:17:11] <akosiaris>	 ?
[15:17:20] <akosiaris>	 what does RAID have to do with .mgmt ?
[15:17:29] <jynus>	 oh, sory
[15:17:39] <jynus>	 ignore my comments
[15:17:49] <akosiaris>	 anyway, no I don't think we 've followed up on it
[15:17:53] <jynus>	 talking about other stuff
[15:18:00] <mark>	 let's make that an actionable
[15:18:17] <jynus>	 in any case, we do not plan of pool it as master any time soon
[15:19:25] <jynus>	 I can connect now to it
[15:19:34] <mark>	 yeah
[15:19:39] <jynus>	 I will check now db1070 and db1071
[15:19:41] <mark>	 but it's a bit worrying that ilo is not reliable
[15:19:46] <mark>	 could happen on other hosts too
[15:19:54] <jynus>	 real and failover master
[15:20:00] <mark>	 if monitoring makes it actually less reliable... hehe
[15:20:13] <jynus>	 in any case, the ILO slowed us down a bit in confusion
[15:20:22] <jynus>	 but it was not that critical
[15:20:34] <marostegui>	 mark: i would be surprised if that was the case, I guess we should have seen it before, but who knows
[15:20:37] <jynus>	 ah, so you think monitoring could cause issues?
[15:20:45] <mark>	 i wouldn't be suprised by that
[15:20:52] <mark>	 those embedded systems often have small limits
[15:20:55] <jynus>	 there was a list of hosts with ILO problems
[15:21:03] <jynus>	 let me see if it was amont those
[15:22:28] <wikibugs>	 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3778177 (10jcrespo) @aaron the proxy is installed but unconfigured, - we still have...
[15:23:13] <jynus>	 I think I actually prefer having monitoring checking the virtual console and knowing there is an issue than the opposite
[15:23:32] <jynus>	 I think there is no coverage of that, as it is so unreliable
[15:23:40] <jynus>	 right?
[15:23:59] <marostegui>	 yeah, there is only monitoring if the mgmt interface is up/down
[15:24:16] <marostegui>	 this is what I wrote in the IR: 	•	Maybe trying to monitor the idrac in a way that we can check if the log-in can actually happen
[15:24:31] <marostegui>	 with "expect" or something, don't know, just an idea
[15:25:32] <jynus>	 bblack was proposing having remote power switches, too
[15:26:31] <mark>	 these are the remote power switches :P
[15:33:40] <wikibugs>	 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3778201 (10Ottomata)
[15:40:09] <wikibugs>	 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3108353 (10Nettrom) The data behind [[ https://page-creation.wmflabs.org/#projects=nlwiki,eswiki,plwiki,itwiki,enwiki,jawiki,dewiki,svwiki,ruwik...
[15:41:57] <wikibugs>	 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3778229 (10elukey) >>! In T156844#3778225, @Nettrom wrote: > The data behind [[ https://page-creation.wmflabs.org/#projects=nlwiki,eswiki,plwiki...
[15:47:51] <wikibugs>	 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3778241 (10Nettrom) Nevermind, turns out @mforns has already updated that configuration, should've checked that first. Thanks again for taking c...
[16:31:13] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3778434 (10Steinsplitter) 05Open>03Resolved a:03Steinsplitter Done. Thanks :)
[16:49:49] <jynus>	 disk space on db2085 seems more reasonable now
[16:50:09] <marostegui>	 Yeah
[16:50:14] <marostegui>	 it finished compressing s5
[16:50:17] <marostegui>	 now I am compressing s3
[16:50:44] <jynus>	 I saw
[16:51:04] <jynus>	 are you compressing everthing, as there is no need for now of the server?
[16:51:20] <marostegui>	 no, not everything just a selection of the biggest tables
[16:51:25] <jynus>	 ok
[16:51:28] <marostegui>	 like templatelinks, text, revision etc
[16:52:27] <jynus>	 we should check other hosts and get rid of unused binlogs
[16:52:41] <marostegui>	 yeah, I was planning on doing that too :)
[16:52:48] <jynus>	 not in a hurry, of course
[16:52:59] <jynus>	 but they end up adding up 100 or 200 GBs
[16:53:43] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3778496 (10Papaul) a:05Papaul>03Marostegui Firmware update complete
[16:55:14] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3778499 (10Marostegui) Thanks @Papaul - I will start mysql, let it run for the night and if all goes fine close this.  If this breaks again, we can contact the vendor and see how to procee...
[19:34:19] <wikibugs>	 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3779106 (10jcrespo) Blocked on getting answers written at T175672#3778177.