[05:26:37] <wikibugs_>	 10DBA: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598#4237712 (10Marostegui) p:05Triage>03Normal db2045 has been restarted. Before:  ``` +------------+ | @@hostname | +------------+ | db2045     | +------------+ +-------------+ | @@server_id | +-------------+...
[05:27:24] <wikibugs_>	 10DBA: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598#4237714 (10Marostegui)
[05:31:26] <wikibugs_>	 10DBA, 10MediaWiki-Database: Create Mediawiki DB abstraction for individual query timeouts - https://phabricator.wikimedia.org/T195792#4237149 (10Marostegui) I would love to see this being solved from the code side, because, as I said, I think it is the correct place to get it solved.  The query killer is a go...
[05:32:58] <wikibugs_>	 10DBA, 10Wikimedia-Site-requests: Global rename of Horcrux92 → Horcrux: supervision needed - https://phabricator.wikimedia.org/T195661#4237718 (10Marostegui) You can proceed with this tomorrow (Wednesday) after 8:00AM UTC. We have some importante maintenance going on today (Tuesday)
[05:33:19] <wikibugs_>	 10DBA, 10Wikimedia-Site-requests: Global rename of Horcrux92 → Horcrux: supervision needed - https://phabricator.wikimedia.org/T195661#4237719 (10Marostegui) p:05Triage>03Normal
[05:37:16] <wikibugs_>	 10DBA, 10MediaWiki-Database, 10MediaWiki-Platform-Team: Create Mediawiki DB abstraction for individual query timeouts - https://phabricator.wikimedia.org/T195792#4237731 (10Legoktm)
[05:43:33] <wikibugs_>	 10DBA, 10MediaWiki-Database, 10MediaWiki-Platform-Team: Create Mediawiki DB abstraction for individual query timeouts - https://phabricator.wikimedia.org/T195792#4237149 (10Legoktm) I haven't looked at the `TransactionProfiler` class to see how it's implemented, but in MediaWiki core we currently have a defa...
[05:50:50] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4237736 (10Marostegui) I am planning to get to this schema change in a few days @Reedy can you confirm that the alters to be run are the ones on: https://gerrit.wikim...
[05:52:25] <wikibugs_>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata-Ministry-Of-Magic: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193#4219166 (10Marostegui) I will start with this task in the next few days
[05:57:13] <wikibugs_>	 10DBA, 10Multi-Content-Revisions, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4237742 (10Marostegui) @Anomie no blockers from your side to get this task started, right?
[05:57:40] <wikibugs_>	 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4237743 (10Marostegui) p:05Triage>03Normal @Anomie no blockers from your side to get this task started, right?
[08:11:15] <jynus>	 should I start moving the replicas around?
[08:35:26] <jynus>	 I have moved db1113:3316, no errors, ok to continue?
[08:35:50] <marostegui>	 +1
[08:36:17] <marostegui>	 I will merge the patch to depool everything in 25 minutes
[08:36:42] <marostegui>	 So we can see if there are any issues with all the hosts depooled for an hour, before starting the maintenance
[08:36:45] <jynus>	 yeah, that can be done later
[08:36:55] <jynus>	 I wanted to do this in advance
[08:37:03] <jynus>	 because there is always some issue
[08:37:03] <marostegui>	 yeah yeah
[08:37:07] <marostegui>	 agreed
[08:43:45] <jynus>	 I am going to disable semisync on s6 master
[08:43:58] <jynus>	 otherwise, when it ends up with 1 replica, it may create issues
[08:44:00] <marostegui>	 ok
[08:44:11] <jynus>	 or
[08:44:23] <jynus>	 I can leave it on and not failover all servers
[08:45:03] <marostegui>	 I don't think there is a big problem disabling semisync for a few hours
[08:51:00] <wikibugs_>	 10DBA, 10MediaWiki-Database, 10MediaWiki-Platform-Team: Create Mediawiki DB abstraction for individual query timeouts - https://phabricator.wikimedia.org/T195792#4238057 (10Addshore) >>! In T195792#4237259, @jcrespo wrote: > The migration to MariaDB 10.1 is not complete, it takes 6 months to a year to prepar...
[08:51:48] <addshore>	 How far through the upgrade to mariadb 10.1 is s8?
[08:52:52] <marostegui>	 addshore: https://dbtree.wikimedia.org/ look for the s8 shards, you can see the versions there
[08:53:03] <addshore>	 thanks!
[08:53:18] <marostegui>	 it is not far, only a few hosts pending, including the master though
[08:53:45] <addshore>	 okay, well, i guess the master doesn't actually matter for limiting selects execution time
[08:54:16] <addshore>	 I'll have to check if the sql abstraction can vary depending on mariadb version, if so I might push for someone writing this query timeout abstraction 
[08:54:29] <marostegui>	 addshore: correct (as long as long selects aren't send to the master, which shouldn't be the case)
[08:54:59] <addshore>	 indeed, well, the abstraction would have to deal with different versions anyway
[08:55:22] <addshore>	 im 90% sure the sql server will send its version information to php while setting up the connection... will have to check
[09:07:50] <marostegui>	 all hosts in row C have been depooled I am monitoring the errors
[09:08:11] <jynus>	 when things are stable I will take a short break
[09:08:33] <marostegui>	 :)
[09:08:46] <jynus>	 I wouldn't be surprised if we have to do some adjustements
[09:09:10] <marostegui>	 yeah, that is why I wanted to depool them with sometime before the depool and the maintenance start
[09:13:23] <jynus>	 I am making some changes to the etherpad, you can give them a check
[09:13:59] <marostegui>	 maybe doing it directly to the gdoc?
[09:14:16] <marostegui>	 Or just make sure they are sync'ed :)
[09:14:46] <jynus>	 I will recopy it after we agree
[09:15:01] <marostegui>	 great, thank you!
[09:15:01] <jynus>	 let's look at etherpad only unless it crashes
[09:17:15] <jynus>	 you can give it a look while I keep looking at the replicas
[09:17:54] <marostegui>	 yeah, I am reviewing it
[09:18:04] <marostegui>	 We can speed it up by disabling puppet in advance on db1061 and merging
[09:18:08] <marostegui>	 but that's a minor thing
[09:20:47] <jynus>	 I don't care much, but I woulsn't do it because we don't know the decision between failover or not?
[09:20:55] <jynus>	 or do you mean something else?
[09:22:38] <marostegui>	 no, exactly
[09:22:42] <marostegui>	 that is why I didn't write it
[09:22:50] <marostegui>	 because we don't know if we will be able to perform it
[09:23:07] <marostegui>	 I think everuything is looking good, you should go and take a break
[09:23:50] <jynus>	 I think they are planning more for a 1 am than 12 am time, probably anyway
[09:25:40] <marostegui>	 yeah, for the s6 primary master that is what he said, around 11 UTC
[10:14:29] <jynus>	 innodb_buffer_pool_dump_at_shutdown take less than 1 second
[10:14:41] <jynus>	 but it is ok, as long as you do a manual dump now
[10:14:46] <marostegui>	 I already ran it
[10:14:51] <jynus>	 ok :-)
[10:17:29] <jynus>	 do you want to lead this time?
[10:17:41] <jynus>	 I can take care of checks and general health?
[10:19:44] <marostegui>	 sure
[11:14:43] <wikibugs_>	 10DBA, 10Operations, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4238457 (10Marostegui) This restart has been done, current values:  ``` +------------+ | @@hostname | +------------+ | db1061     | +------------+...
[11:15:52] <wikibugs_>	 10DBA, 10Operations, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4238458 (10Marostegui) We can now revert: https://gerrit.wikimedia.org/r/#/c/435182/
[11:24:56] <wikibugs_>	 10DBA: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598#4238490 (10Marostegui)
[11:25:01] <wikibugs_>	 10DBA, 10Operations, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4238487 (10Marostegui) 05Open>03Resolved a:03Marostegui This is all done, including restarting db1125:s6 to pick up the new server_id.
[11:25:40] <wikibugs_>	 10DBA: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598#4231846 (10Marostegui) 05Open>03Resolved a:03Marostegui The other host for this task, db1061 has been restarted. So the scope is done!
[11:37:02] <jynus>	 QPS on s6 are higher
[11:37:20] <jynus>	 is is the new version? human-driven? something else?
[11:38:05] <marostegui>	 semi sync is back on btw
[11:38:16] <jynus>	 I actually didn't disable it
[11:38:21] <marostegui>	 aah ok ok
[11:40:12] <marostegui>	 We have more disks reads on the master, I guess the buffer pool being cold
[11:43:48] <marostegui>	 I wonder when s4 codfw will be able to catch up...
[11:44:21] <jynus>	 when writes stop happening?
[11:44:33] <jynus>	 which given the size of that table may take a month
[11:44:41] <jynus>	 and wait when it reaches enwiki
[11:45:14] <jynus>	 marostegui: re load, yeah, I expect more disk reads/writes
[11:45:28] <jynus>	 but I wouldn't exect like now extended logical reads
[11:47:09] <marostegui>	 there is also a huge increase on updates
[11:47:10] <marostegui>	 jobqueue?
[11:50:39] <jynus>	 probably
[11:50:58] <jynus>	 maybe it is the maintenance job?
[11:51:05] <jynus>	 not job?
[11:51:07] <marostegui>	 could be
[11:51:33] <jynus>	 selects are a bit higher too
[11:52:04] <jynus>	 450 -> 950/s
[11:52:06] <marostegui>	 they are now decreasing
[11:52:18] <marostegui>	 (the updates)
[11:52:21] <jynus>	 oh, I just saw that
[11:54:06] <marostegui>	 it is now going back to normal values on everything pretty much
[12:45:56] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238729 (10Reedy) >>! In T89737#4237736, @Marostegui wrote: > I am planning to get to this schema change in a few days > @Reedy can you confirm that the alters to be...
[12:49:26] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238744 (10Marostegui) >>! In T89737#4238729, @Reedy wrote: >>>! In T89737#4237736, @Marostegui wrote: >> I am planning to get to this schema change in a few days >>...
[12:59:04] <wikibugs_>	 10DBA, 10Data-Services, 10Quarry: Cannot reliably get the EXPLAIN for a query on analytics wiki replica cluster - https://phabricator.wikimedia.org/T195836#4238772 (10zhuyifei1999)
[13:04:03] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238793 (10Reedy) Interesting. The list of s3 wikis that it exists on would look like it's the wikis that have been created since the feature was added...  Seemingly...
[13:05:58] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238794 (10jcrespo) Other features -whether it is reasonably or not, it is a different question- store private data on a less common part of the infrastructure. Just...
[13:07:23] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238795 (10Reedy) Bot Passwords are disabled on private wikis et al  ```lang=php 'wmgEnableBotPasswords' => [  'default' => true,  'private' => false,  'fishbowl' =>...
[13:08:09] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238796 (10jcrespo) > I completely missed that it's actually a global feature  May I ask is it is part of core or if it depends on a particular extension, which one (...
[13:08:42] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238797 (10Reedy) >>! In T89737#4238796, @jcrespo wrote: >> I completely missed that it's actually a global feature >  > May I ask is it is part of core or if it depe...
[13:10:43] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238804 (10jcrespo) > It's in core https://www.mediawiki.org/wiki/Manual:Bot_passwords  So, this is my proposal, based on your feedback- create it empty on all wikis,...
[13:24:30] <_joe_>	 jynus: what was the ticket about cross-dc connection pooling?
[13:24:36] <_joe_>	 I can't find it anymore
[13:24:49] <_joe_>	 phab search is suboptimal :)
[13:27:49] <jynus>	 there is https://phabricator.wikimedia.org/T134809
[13:28:03] <jynus>	 and https://phabricator.wikimedia.org/T171071
[13:29:50] <_joe_>	 jynus: I don't see in any of those tickets a plan for connection pooling; or better, you mentioned at some point you won't be using proxysql
[13:29:57] <_joe_>	 is that still the plan?
[13:30:27] <jynus>	 I just mentioned it on the meeting
[13:30:53] <jynus>	 I was going to install proxysql 1.X on all masters
[13:31:07] <jynus>	 and tunnel it to the equivalent on the other DC
[13:31:12] <_joe_>	 just one detail I'm not sure about. How we're going to manage this in mediawiki-config
[13:31:20] <_joe_>	 and how that will affect the switchover
[13:31:34] <jynus>	 well, for once, I asked tim and he said he will take care
[13:31:36] <_joe_>	 I guess we can weasel our way aaround it
[13:31:54] <jynus>	 from my point of view, it may not need any change
[13:32:15] <_joe_>	 jynus: do you want to manage the switchover from proxysql?
[13:32:24] <jynus>	 just if the active dc is different from the currenc dc, connect to a separate service
[13:32:31] <jynus>	 on the same host
[13:32:34] <_joe_>	 oh ok
[13:32:37] <_joe_>	 so a different port
[13:32:39] <_joe_>	 that's it
[13:32:42] <jynus>	 something like that
[13:32:43] <_joe_>	 and with rls
[13:32:47] <_joe_>	 *tls
[13:32:52] <_joe_>	 ok, seems sensible to me
[13:32:52] <jynus>	 details said that platform will take care
[13:32:57] <jynus>	 I mount the platform
[13:33:22] <jynus>	 and unless I get negative fedback that is not ok, send it upwards :-)
[13:33:43] <jynus>	 because if not, we get stalled on the other people to suggest something :-)
[13:33:54] <jynus>	 I also said this is not ideal
[13:33:59] <jynus>	 for long term
[13:34:07] <jynus>	 but masters and configuration is not ideal now
[13:34:10] <_joe_>	 yeah but it's good enough given the amount of writes we're talking about
[13:34:15] <jynus>	 exactly
[13:34:19] <jynus>	 and once etcd is there
[13:34:20] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238957 (10Marostegui) >>! In T89737#4238744, @Marostegui wrote: >>>! In T89737#4238729, @Reedy wrote: >>>>! In T89737#4237736, @Marostegui wrote: >>> I am planning t...
[13:34:20] <_joe_>	 I'm more worried about the sessions store
[13:34:40] <jynus>	 and maybe a better failover system
[13:34:42] <jynus>	 we can iterate
[13:34:49] <_joe_>	 jynus: if -as it appears more probable by the day - I won't be in prague, I'll work on dbs-on-etcd all of that week
[13:34:58] <jynus>	 :-(
[13:35:19] <jynus>	 can I ask a question?
[13:35:42] <jynus>	 becuase I am told al lthe time tunneling is bad, and I get the surface
[13:35:47] <jynus>	 reasons
[13:35:52] <jynus>	 but I don't know the details
[13:35:58] <wikibugs_>	 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4238964 (10Reedy) ^ for s3  Then obviously still metawiki  And for the rest of the wikis... ```lang=sql CREATE TABLE IF NOT EXISTS bot_passwords (   bp_user int unsig...
[13:35:59] <jynus>	 do you know something about that?
[13:36:12] <jynus>	 eg. what are the exact pains on varnish?
[13:36:19] <jynus>	 or is it just the SPOF nature
[13:36:25] <_joe_>	 what do you mean by tunneling?
[13:36:28] <jynus>	 or is it unreliable for othe reaasons?
[13:36:29] <_joe_>	 stunnel and such?
[13:36:47] <jynus>	 I don't even know the details of how varnish commincates
[13:36:58] <_joe_>	 you mean tls from varnish to the backend?
[13:37:12] <_joe_>	 yeah without a native implementation it's pretty hard to get the tunneling right and reliable
[13:37:23] <jynus>	 honestly, I don't know, I just been told it is bad
[13:37:29] <jynus>	 and I want to understand why
[13:37:48] <jynus>	 so as the usage here is different, try to avoid issues as much as possible
[13:37:51] <_joe_>	 yes, you make all of your connections to all of your backends depend on a small piece of software that works at l3
[13:37:55] <_joe_>	 not even l4
[13:38:04] <_joe_>	 you have an application proxy, at l7
[13:38:23] <jynus>	 yes, that point is to make it do pooling and TLS at the same time
[13:38:38] <jynus>	 but 1.2 TLS support is not yet stable
[13:38:51] <jynus>	 so the long term approach is that, but it is not doable right now
[13:39:07] <jynus>	 but I want to know if it is just that
[13:39:12] <jynus>	 or something more practical
[13:39:20] <jynus>	 that is the part I knew
[13:39:40] <jynus>	 like "varnish X feature fails all the time"
[13:39:49] <jynus>	 we had issues when setting up Y
[13:40:17] <jynus>	 I am not negating those, I want to know to avoid them, rather than the general stuff which I know it is not ideal
[13:40:44] <_joe_>	 so practically if we're talking about the volumes of requests varnish does to the backend
[13:40:52] <_joe_>	 I'm pretty sure stunnel is a performance killer
[13:40:57] <jynus>	 ok
[13:41:10] <_joe_>	 would probably raise the latencies by several milliseconds
[13:41:12] <jynus>	 but maybe for the traffic we plan to support
[13:41:18] <_joe_>	 for mysql?
[13:41:20] <jynus>	 it is not such a big deal?
[13:41:28] <jynus>	 GETs that do state changes?
[13:41:31] <_joe_>	 I think it wouldn't be a huge deal myself, nope
[13:41:39] <jynus>	 and as I said
[13:41:41] <_joe_>	 but it has no pooling of connections, ofc
[13:41:47] <jynus>	 exactly
[13:41:53] <jynus>	 that is the point of proxysql
[13:41:53] <_joe_>	 so latencies
[13:41:55] <jynus>	 and in the future
[13:42:06] <jynus>	 getting rid of the non-L7 tunneling
[13:42:14] <_joe_>	 proxysql 2.0 doesn't support tls 1.2?
[13:42:18] <jynus>	 (hopefuly months away)
[13:42:21] <jynus>	 it does
[13:42:26] <_joe_>	 or you don't trust it for being still beta
[13:42:32] <jynus>	 yes
[13:42:35] <jynus>	 the second
[13:42:36] <_joe_>	 ok, I see :)
[13:42:48] <_joe_>	 so you prefer something "stable" like stunnel
[13:42:58] <jynus>	 but for a short period of time
[13:43:04] <_joe_>	 well, if that's what we want to do, uhm
[13:43:04] <jynus>	 not sure about stunnel
[13:43:10] <_joe_>	 stunnel/whatever
[13:43:17] <jynus>	 whatever you /traffic recommends
[13:43:25] <_joe_>	 "a thing that does tunneling of TLS"
[13:43:28] <jynus>	 you as in, app people, if you get what you mean
[13:43:29] <_joe_>	 ok
[13:43:33] <jynus>	 *I
[13:43:33] <_joe_>	 yes
[13:43:45] <_joe_>	 ok we need to discuss this with traffic too
[13:43:50] <jynus>	 also
[13:43:52] <jynus>	 in the future
[13:43:57] <jynus>	 when we have proxysql 2.0
[13:44:01] <jynus>	 and tls support
[13:44:09] <_joe_>	 we can switch there, yup
[13:44:20] <jynus>	 that is what I said that we could consider mysql active-active dc
[13:44:33] <jynus>	 consider doesn't mean it is easy and there are many things to fix
[13:44:47] <jynus>	 but I want to think we will get eventually there
[13:44:58] <_joe_>	 +1
[13:45:01] <jynus>	 and not setup anything that makes things worse
[13:45:54] <jynus>	 on those tickets, we did performance testing
[13:46:15] <jynus>	 and if we had connection pooling, which also is interesting for local-dc connections
[13:46:46] <jynus>	 latency is not great, but throughput of 1/2 of the traffic could be (maybe)doable
[13:47:12] <jynus>	 to the point that the biggest issues nobody is thinking about is replication lag
[13:47:39] <jynus>	 I mean, it is not that nobody is thinking, I mean that I belive that will be the worse problem to solbe
[13:47:42] <jynus>	 *solve
[13:48:03] <jynus>	 so, _joe_, in summary
[13:48:45] <jynus>	 an unperfect proxysql setup now may work, and will may in the future open the possibility of improving multi-dc and load balancing
[13:49:01] <jynus>	 in small step improvements
[13:49:38] <jynus>	 becuase traffic is not my strenght, I ask you suggestions (app layer/traffic) for the tunneling part
[13:50:33] <jynus>	 specially to setup the same thing other people are using
[13:50:58] <jynus>	 (for suggestions, I will be setting it up myself)
[14:52:04] <_joe_>	 I don't think we have a predefined solution
[14:52:15] <_joe_>	 but let's talk with vgutierrez as well tomorrow
[14:52:58] <_joe_>	 jynus: replication lag across DCs is kinda protected by chronologyprotector
[14:53:04] <jynus>	 yeah
[14:53:26] <jynus>	 but just applying that cross-dc will reduce our throughput 10x
[14:53:28] <_joe_>	 we need to switch it to use the GTID though
[14:53:36] <_joe_>	 applying what?
[14:53:39] <jynus>	 it already uses GTID
[14:53:48] <_joe_>	 chronologyProtector?
[14:53:51] <_joe_>	 are you sure?
[14:54:01] <_joe_>	 I think it still requires master_pos_wait
[14:54:03] <jynus>	 (and actually I want to stop doing that, but that is another story)
[14:54:42] <_joe_>	 I mean it's configured in production to use the local master position, rather than the gtid
[14:55:02] <_joe_>	 it supports all three methods - local binlog position, mariadb gtid and mysql gtid
[14:55:05] <jynus>	 _joe_: look at this:
[14:55:37] <jynus>	 https://dbtree.wikimedia.org/ search the tree of s6
[14:56:03] <_joe_>	 so say we have a laggy operation right now
[14:56:15] <_joe_>	 that creates a 2 seconds lag at each hop
[14:56:21] <jynus>	 ?
[14:56:32] <jynus>	 no, I mean that we cannot be using master binlog positions
[14:56:47] <jynus>	 because all our s6 replicas do not know the master position
[14:56:49] <_joe_>	 yeah we don't in mysql for replication
[14:57:03] <_joe_>	 I'm not sure how we configured mediawiki
[14:57:05] <jynus>	 because its master is not the real master
[14:57:11] <_joe_>	 I know
[14:57:14] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239421 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete
[14:57:22] <_joe_>	 I'm saying we might have that completely misconfigured already
[14:57:31] <jynus>	 I want to stop using gtid, for reasons taht are unrelated
[14:57:42] <jynus>	 but that is not a discussion for today
[14:58:18] <jynus>	 (that and the wait for gtid errors I see from time to time, which means gtid is being used)
[14:58:51] <jynus>	 the problems is that waiting for a local server, which btw has semisync configured
[14:59:04] <jynus>	 takes between 0 and maybe 1 ms
[14:59:18] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239430 (10Marostegui) Thanks! ```       physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Rebuilding) ```
[14:59:29] <jynus>	 replicating to the slowest db on a separate datacenter
[14:59:37] <jynus>	 may take 30-100 miliseconds
[15:00:02] <jynus>	 I haven't really testted, but check this
[15:00:13] <jynus>	 go again to https://dbtree.wikimedia.org/
[15:00:16] <jynus>	 and check s4
[15:00:32] <jynus>	 there is currently 177319 seconds of lag on codfw
[15:01:03] <jynus>	 because a maintenance script is being run that waits for eqiad but not for codfw
[15:01:41] <jynus>	 if it was a latency issue and not influenced the througput, we would have at most 1 second of lag
[15:02:12] <jynus>	 also, as you said, immediate replicas are fast because they can be optimized to apply changes in parallel
[15:02:49] <jynus>	 but 3 or 4 tiers have problems keeping up, because a replica of a replica cannot execute things with the same level of parallelism
[15:03:23] <jynus>	 so there may need a not-so-simple strategy to balance performance AND lag
[15:04:00] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239454 (10Marostegui) a:05Marostegui>03Papaul Disk failed ```       physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Failed) ```  Can we get another one?
[15:04:15] <jynus>	 extending the current is technicallty possible (there is nothing to do) but may not be wise
[15:10:37] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239500 (10Papaul) a:05Papaul>03Marostegui Another disk in place
[15:37:59] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4239681 (10Marostegui) Let's see how this one goes: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 25% complete)   ```
[16:51:56] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4240202 (10Cmjohnson) Support ticket created with HPE  Case ID: 5329764075 Case title: Failed Hard Drive Severity 3-Normal Product serial number: MXQ62005Z0 Product number: 767032-B21 Submitted: 5/29/201...
[17:10:16] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4240271 (10Marostegui) 05Open>03Resolved All good - thanks! ```       logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physicaldriv...
[17:11:01] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4240277 (10Marostegui) Awesome! Thank you!
[18:12:45] <wikibugs_>	 10DBA, 10MediaWiki-Database, 10MediaWiki-Platform-Team: Create Mediawiki DB abstraction for individual query timeouts - https://phabricator.wikimedia.org/T195792#4237149 (10Anomie) As for implementing something like this in other DBs: * I don't see an equivalent for SQLite. The only timeout it seems to have...
[18:21:56] <wikibugs_>	 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4240535 (10Anomie) This is good to go from my end.
[18:23:00] <wikibugs_>	 10DBA, 10Multi-Content-Revisions, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4240536 (10Anomie) This is good to go from my end.
[20:19:55] <wikibugs_>	 10DBA, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875#4241019 (10Volans) @jcrespo FYI I was deploying debmonitor today and the replication broke on `db1065` and `db1117` because of missing `debmonitor` database.  I...
[20:43:03] <wikibugs_>	 10DBA, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875#4241121 (10jcrespo) @Volans I believe this to be https://bugs.launchpad.net/mydumper/+bug/1558164 of myloader, used to clone the database. This is a nasty bug,...
[20:45:54] <wikibugs_>	 10DBA, 10Operations-Software-Development, 10Patch-For-Review: Debmonitor: request for misc DB allocation - https://phabricator.wikimedia.org/T192875#4241137 (10Volans) @jynus got it, thanks for the info. FYI if you want to test your workaround solution, there is another DB missing: `frimpressions`. I didn't...