[05:31:26] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141772 (10Marostegui) So this is almost confirmed related to atop. I killed it yesterday at around 14:30 and it was remained stopped till 00:00 (where it started automatic...
[05:35:32] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4141775 (10Marostegui) RX buffers reverted ``` root@db1114:~# ethtool -g eno1 Ring parameters for eno1: Pre-set maximums: RX:  2047 RX Mini: 0 RX Jumbo: 0 TX:  511 Current...
[08:01:16] <wikibugs>	 10DBA: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462#4141903 (10Rduran) a:03Rduran
[08:59:44] <_joe_>	 let me know when everyone's around, and we can start
[09:00:04] * volans \o/
[09:00:50] <jynus_>	 I'm here
[09:01:05] <jynus_>	 I don't think manuel will make it
[09:01:16] <_joe_>	 ok
[09:02:07] <_joe_>	 so, I started thinking about the "databases pooling/depooling on etcd" problem
[09:02:29] <_joe_>	 and I wanted first and foremost confirmation from you that what we're interested in is what follows:
[09:02:51] <_joe_>	 - generate 'sectionLoads' (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php#104) from etcd data
[09:03:23] <_joe_>	 - generate 'groupLoadsBySection' (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php#259) from etcd
[09:04:16] <_joe_>	 jynus: is that correct? I extracted those as the two parts of that file that you and marostegui modify most
[09:04:26] <jynus>	 so yes and no
[09:04:37] <_joe_>	 please do tell :)
[09:04:48] <jynus>	 personally, what I would want is something else, speciall on interface
[09:05:12] <jynus>	 whether we have to translate into that, I guess will depend?
[09:05:29] <jynus>	 what we *need*
[09:05:50] <jynus>	 is a way to pool and depool fully individual servers
[09:06:05] <jynus>	 and to set which of the pooled servers is the master
[09:06:15] <_joe_>	 set the master from etcd?
[09:06:23] <jynus>	 yes
[09:06:33] <_joe_>	 ok, I thought you specifically didn't want that
[09:06:38] <jynus>	 technically what you mention already did that
[09:06:45] <volans>	 jynus: let me clarify one thing, when joe means 'modify' it doesn't mean modify those arrays in conftool as they are now
[09:06:46] <jynus>	 sectionLoads sets the mediawiki master
[09:07:02] <jynus>	 as the one being defined the first one
[09:07:27] <volans>	 whatever we have in etcd will be re-constructed into those structures becaue MW wants that, but we can have more or less what we want in the etcd side
[09:07:28] <_joe_>	 jynus: when you say "depool" you mean "remove from the configuration", right?
[09:07:32] <jynus>	 no
[09:07:43] <jynus>	 there is 2 states- added into the config
[09:07:47] <_joe_>	 you mean weight=0 ?
[09:07:52] <jynus>	 (lines at the end)
[09:07:58] <jynus>	 and pooled (with load)
[09:08:08] <jynus>	 actually weight 0 does not depool a server
[09:08:08] <_joe_>	 how do you currently depool a server?
[09:08:27] <jynus>	 we commend all lines on sectionLoads and groupLoadsBySection
[09:08:35] <_joe_>	 ok, as I suspected
[09:08:50] <_joe_>	 are you interested in being able to change the weights as well?
[09:08:53] <jynus>	 seting weight to 0 makes it still being pooled by load balancer
[09:08:55] <_joe_>	 via etcd I mean
[09:09:02] <jynus>	 yes
[09:09:15] <volans>	 we can have weight -1 that removes it and weight >= 0 that sets the weight
[09:09:16] <jynus>	 let me put a priority to all of that
[09:09:26] <_joe_>	 ok
[09:09:30] <jynus>	 if that is helpful
[09:09:41] <_joe_>	 this is all doable, but yes please
[09:10:09] <jynus>	 pool/depool > weights per server > master definition > read-only
[09:10:28] <_joe_>	 ok
[09:10:30] <jynus>	 the problem as it is now, is that pooling/depooling is overly complicated
[09:10:37] <jynus>	 (aside from static)
[09:10:54] <jynus>	 each server can be pooled on 6 different "services"
[09:10:58] <jynus>	 actually more
[09:11:03] <_joe_>	 jynus: so this would all be solved (minus master definition) by the strawman I prepared
[09:11:19] <marostegui>	 09:10 < jynus> pool/depool > weights per server > master definition > read-only --- agreed!
[09:11:20] <_joe_>	 but I see one problem - you would need to visualize the state of clusters
[09:11:28] <jynus>	 I don't care how mediawiki does it
[09:11:42] <jynus>	 but we need a ./depool in the end
[09:11:47] <jynus>	 for easy orchestration
[09:11:58] <jynus>	 (e.g. rolling restart/schema changes)
[09:12:00] <_joe_>	 you need a way to see - immediately - what is configured for a specific section
[09:12:06] <jynus>	 yes
[09:12:11] <_joe_>	 ok
[09:12:35] <jynus>	 one common error nowadays FYI
[09:12:46] <jynus>	 is to depool a server from a service
[09:12:51] <marostegui>	 Ideally we should be able to say: ./depool dbXXXX vslow or soomething like that. But for now a general depool/pool would be cool from me
[09:12:55] <volans>	 and probably a safe check like pybal to not depool if there are too few pooled
[09:12:55] <jynus>	 but forget other kinds of traffic
[09:13:00] <marostegui>	 Will simplifly things a lot already
[09:13:17] <jynus>	 volans: correct, there should always be 1 server for each type of traffic
[09:13:26] <jynus>	 with exceptions
[09:13:39] <_joe_>	 please let's remain focused :)
[09:13:51] <_joe_>	 the discussion on safety checks can come later
[09:13:54] <_joe_>	 and will happen
[09:13:57] <jynus>	 _joe_: my point is we may need deep mediawiki changes too
[09:14:05] <_joe_>	 jynus: I don't think so
[09:14:07] <jynus>	 not just moving variables to etcd
[09:14:16] <volans>	 for what?
[09:14:30] <_joe_>	 but let's first concentrate on the desired workflow for you people?
[09:14:52] <jynus>	 _joe_: I can tell you how we work for those patterns
[09:15:02] <jynus>	 as "common things we do"
[09:15:09] <jynus>	 is that helpful?
[09:15:15] <_joe_>	 jynus: yes 
[09:15:20] <_joe_>	 very useful
[09:15:22] <_joe_>	 :)
[09:15:27] <jynus>	 so, roughly in order of frequency
[09:15:54] <jynus>	 server is bad state or will be for maintenance, depool server - no longer send any traffic to it
[09:16:08] <jynus>	 server is back and maintenance finished, -pool with low weight
[09:16:09] <marostegui>	 +1
[09:16:31] <jynus>	 server is ok, pool with original traffic
[09:16:51] <jynus>	 (a server may take 1h to 12h hours to be in perfect condition because cache)
[09:17:24] <volans>	 Q: in those cases, let's say the server is also in the vslow section, and is the only one
[09:17:26] <jynus>	 a server is overloaded, spread traffic among other hosts (change weights for several types of traffic)
[09:17:41] <_joe_>	 volans: wait please
[09:18:01] <jynus>	 I have other kinds of things, I am thinking which is more likely
[09:18:11] <jynus>	 but on a lesser immediate need:
[09:18:28] <jynus>	 - swichover/failover the master to another host for maintenance, upgrade or failure
[09:18:46] <jynus>	 - for the previous, read only has to be set for the section temporarilly
[09:19:15] <jynus>	 things like adding new servers or removing them can be fully static (git)
[09:19:43] <_joe_>	 yes, adding servers will happen via commits to (for now) both mediawiki-config and conftool-data
[09:19:47] <jynus>	 that is ok
[09:19:55] <jynus>	 also for decommissioning
[09:20:13] <jynus>	 our biggest pain is that a schma change or other maintenace (upgrade)
[09:20:23] <jynus>	 requires a rolling pool/depool
[09:20:28] <_joe_>	 so, can I focus on the first set of actions?
[09:20:32] <jynus>	 and now that is very very painful
[09:20:34] <jynus>	 please do
[09:20:39] <_joe_>	 depool/warmup pool/ full pool
[09:20:49] <_joe_>	 there is a major UX change at hand
[09:21:15] <marostegui>	 Yeah, I agree depool/warm up pool/full pool will aliviate A LOT already
[09:21:22] <jynus>	 my thoughts were to change the interface from section-based
[09:21:24] <_joe_>	 right now, to make sure you don't screw anything up, you are relying on static arrays you can inspect in your editor, and on code-review
[09:21:27] <jynus>	 to server-based
[09:21:40] <jynus>	 at least for our side
[09:21:54] <_joe_>	 once data is in etcd, you don't have that luxury anymore
[09:21:55] <jynus>	 so instead of saying sectionX has server X, Y and Z
[09:22:12] <_joe_>	 jynus: we'll get to implementation later, bear with me
[09:22:16] <jynus>	 ok
[09:22:28] <_joe_>	 so what I envision as a workflow is:
[09:23:34] <_joe_>	 - dbconfig get dbXXX[:PORT] gets you all the current configuration of a mysql instance (so either host:port or host only if it's the default port)
[09:24:12] <_joe_>	 - dbconfig list s1 shows what is the current configuration for s1
[09:24:47] <_joe_>	 - dbconfig depool dbXXX:PORT removes the database for all configurations
[09:25:12] <jynus>	 (small correction, we should just use label, which is normally host:port, but could be arbitrary, labels are defined staticaly on code)
[09:26:08] <jynus>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php#516
[09:26:09] <_joe_>	 - dbconfig pool LABEL repools the server at the weights defined in its configuration
[09:26:47] <_joe_>	 - dbconfig edit LABEL allows you to edit the details of the configuration, including weights
[09:27:25] <jynus>	 can the configuration be multivalued?
[09:27:41] <_joe_>	 - dbconfig warmup LABEL 0.1 pools a depooled database with weights set to ceil(weights*0.1)
[09:27:53] <_joe_>	 jynus: what do you mean "multivalued"?
[09:28:10] <jynus>	 a server has a weight for a type of traffic
[09:28:14] <_joe_>	 yes
[09:28:35] <jynus>	 in other words, it has multiple weights for multiple 'main', 'vslow', etc.
[09:28:40] <_joe_>	 so my original strawman for the etcd schema was https://gerrit.wikimedia.org/r/#/c/422373/
[09:29:01] <volans>	 now, given this status, I mentioned that we would loose the cluster view of things, that I think is important when moving things around
[09:29:22] <_joe_>	 volans: yes, that's why I think we should decide safety checks
[09:29:35] <_joe_>	 there is another thing in my proposal that I wanted confirmed
[09:29:52] <_joe_>	 a single LABEL can refer to multiple sections, right?
[09:30:19] <jynus>	 letts call it instance, label is not really the official name
[09:30:33] <_joe_>	 ok
[09:30:41] <_joe_>	 can one instance serve multiple sections?
[09:31:00] <jynus>	 not normally, but it can happen
[09:31:08] <_joe_>	 if so, I guess we need "depool/pool" to be able to act on all sections or just one
[09:31:17] <jynus>	 but it would be ok
[09:31:24] <jynus>	 to depool only 1 section in that case
[09:31:31] <jynus>	 that is a very special case
[09:31:40] <_joe_>	 ok
[09:31:45] <jynus>	 like when moving a wiki from one to the other
[09:31:56] <volans>	 if we go towards multi-instances a host:port should belong to only one section AFAIUI
[09:32:03] <jynus>	 or creating a new section
[09:32:10] <jynus>	 volans: yes, but there are cases where it can happen
[09:32:22] <jynus>	 when we created s8 by splitting s5
[09:32:33] <jynus>	 hosts were at the same time on s8 and s5
[09:32:36] <_joe_>	 anyways, let me list the safety checks I would like to add, and tell me if more are neededL
[09:32:38] <volans>	 sure
[09:32:49] <jynus>	 volans: it should be no the norm
[09:33:13] <_joe_>	 Any action we take should guarantee that the following conditions still hold valid:
[09:33:16] <jynus>	 _joe_: how easy is to edit the schema afterwards
[09:33:21] <jynus>	 ?
[09:33:37] <_joe_>	 jynus: not very hard, just need care :)
[09:33:42] <jynus>	 e.g. new sections can be added
[09:33:53] <jynus>	 one extra clariffication 
[09:33:57] <jynus>	 before you go on
[09:33:57] <_joe_>	 sections can be added easily
[09:34:13] <jynus>	 there is additional sections that are hidden
[09:34:45] <_joe_>	 ?
[09:34:47] <jynus>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php#666
[09:35:07] <jynus>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php#10
[09:35:24] <jynus>	 those are sections like the others from our point of view, but are defined on other arrays
[09:35:26] <_joe_>	 ok, those sections are separated from the main databases
[09:35:35] <_joe_>	 for now
[09:35:40] <jynus>	 es1, es2, es3 and pc1,2,3
[09:35:47] <_joe_>	 we can convert them later I guess
[09:35:51] <jynus>	 just saying that those eventually will have to be handled
[09:35:58] <jynus>	 but will be more of the same
[09:36:05] <_joe_>	 yeah, we'll bend those use-case to the general one I guess
[09:36:24] <_joe_>	 so, back to safety checks, I would say that after an edit we have to ensure:
[09:36:30] <_joe_>	 - a section has a master
[09:36:39] <_joe_>	 - a section has at least N instances
[09:36:51] <_joe_>	 - every group has at least 1 instance per section
[09:37:04] <_joe_>	 group as in vslow, etc
[09:37:20] <jynus>	 let me thing about that
[09:37:33] <jynus>	 as that is trivially true for normal cases
[09:37:42] <jynus>	 I am thinking bad states now 
[09:37:42] <_joe_>	 if any of these conditions is not met (we can add/remove any) we refuse an edit
[09:37:51] <volans>	 _joe_: I would say also groups should have N configurable minimum instances
[09:37:56] <_joe_>	 unless the user passes the magic --bblack command-line swithc
[09:38:02] <jynus>	 and whether if we should enforce on schema
[09:38:12] <jynus>	 or just give a warning on edit
[09:38:38] <_joe_>	 jynus: I would say we add a --force switch to the command line to let you shoot yourself in the foot in case of need
[09:38:43] <_joe_>	 bypassing all safety checks
[09:38:49] <jynus>	 In particular, "- every group has at least 1 instance per section" is not true right now
[09:38:55] <jynus>	 and that is as a normal thing
[09:38:55] <_joe_>	 also, you can use confctl directly, which bypasses all that
[09:39:09] <_joe_>	 jynus: ok so we can remove that safety check
[09:39:19] <jynus>	 for example, s3 (small wikis) do not really need different groups because how small wikis are
[09:39:25] <jynus>	 so we consolidate on just 4 servers
[09:39:28] <_joe_>	 ok
[09:40:03] <jynus>	 of course, that could be changed to have an arbitrary host, or all of them
[09:40:07] <_joe_>	 I have enough material to make you a complete and concrete proposal, I think
[09:40:10] <jynus>	 for the sake of normalization
[09:40:16] <volans>	 that's why I think groups should have a configurable N of hosts, maybe tomorrow for s1 we need at least 3 recentchanges, who knows
[09:40:35] <_joe_>	 volans: yeah we can do that adding an object per section
[09:40:54] <jynus>	 _joe_: I think checks should be configurable outside of the intrinsic schema
[09:41:08] <jynus>	 so it is easy to add exceptions without touching it
[09:41:53] <jynus>	 _joe_: also outside of etcd, interfaces to see the state
[09:41:54] <_joe_>	 jynus: yes, my idea was you have a "section s1" object where you define things like: which instance is master, minimum number of general instances, minimum number for each group
[09:42:11] <volans>	 checks can be done on 2 levels, basic ones with json schema to enforce that the schema is formally correct, and then with code for more complex logic checks (joe correct me if I'm wrong)
[09:42:13] <_joe_>	 jynus: yes, that's relatively easy I think
[09:42:17] <jynus>	 basically, https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php
[09:42:28] <_joe_>	 volans: yes that's what I was talking about
[09:42:38] <jynus>	 (important, remember there is 2 configuration, one per datacenter at the time)
[09:43:00] <jynus>	 I think that is already in place for mediawiki
[09:43:02] <_joe_>	 jynus: basically the output of "dbconfig show all" :)
[09:43:07] <jynus>	 so probably it is trivual to do
[09:43:08] <jynus>	 yeah
[09:43:27] <jynus>	 I would also search the feedback from aarong (I can ask him)
[09:43:30] <_joe_>	 so my first strawman for mediawiki was https://gerrit.wikimedia.org/r/#/c/422374/
[09:43:34] <jynus>	 as he is the reciver of all of that
[09:43:45] <_joe_>	 we will have a more refined version of that in some time
[09:43:53] <jynus>	 as in, his loadbalancer is the one that takes that and uses it
[09:43:57] <_joe_>	 I'll write a ticket with all the info I gathered now
[09:44:13] <volans>	 jynus: as of now the proposal is to have MW see exactly the same structure
[09:44:23] <volans>	 just that the data comes from etcd instead of the static file
[09:44:23] <jynus>	 _joe_: regarding workflow
[09:44:24] <_joe_>	 yes, mediawiki will not change at all
[09:44:37] <jynus>	 should we test on a single, small section first?
[09:44:49] <_joe_>	 jynus: in beta, even
[09:44:53] <jynus>	 of course
[09:44:56] <_joe_>	 and then on a small section, yes
[09:44:56] <jynus>	 I mean after that
[09:45:01] <jynus>	 on production testing
[09:45:12] <jynus>	 so normally we go for s6 and s2
[09:45:19] <_joe_>	 we can even switch everything only on the debug servers only
[09:45:27] <jynus>	 that would be cool
[09:45:41] <jynus>	 as this is the kinf od thing we find issues after a long time
[09:45:53] <_joe_>	 yeah
[09:45:59] <jynus>	 then go an test maintenance cycles with it
[09:46:03] <_joe_>	 you want to play with a sandbox
[09:46:05] <jynus>	 automation,etc.
[09:46:06] <_joe_>	 makes sense
[09:46:27] <jynus>	 do you have 5 minutes to review one thing
[09:46:33] <jynus>	 that is not technically related to it
[09:46:38] <jynus>	 but in a way it does
[09:46:50] <jynus>	 (unless you want to ask more questions)
[09:46:55] <jynus>	 and please ask for help
[09:46:56] <volans>	 Q: I know you're designating some hosts to be delegated masters in the sense that should be the first host to look in case of a master failure and have STATEMENT binlongs, is that an information that would be useful to have in the dbconfig output from etcd?
[09:47:09] <jynus>	 volans: I don't think so
[09:47:16] <jynus>	 these are comments for us
[09:47:28] <jynus>	 so we pre-select and prepare in case of an emergency
[09:47:31] <_joe_>	 jynus: no I have all the info I needed
[09:47:40] <jynus>	 it would be nice to have may comments on etcd?
[09:47:43] <marostegui>	 mark, jynus I have changed the meeting for tomorrow same time.
[09:47:52] <_joe_>	 it looks like volans and I have some coding to do, but this is generally already in the direction we had
[09:48:20] <jynus>	 as we loose the "# broken, do not repool"?
[09:48:29] <jynus>	 maybe a comment on the host entry?
[09:48:36] <volans>	 that make sense
[09:48:39] <volans>	 *makes
[09:48:40] <jynus>	 or something on the text configuration version?
[09:48:41] <_joe_>	 jynus: heh, fair enough
[09:48:53] <jynus>	 I assume that is trivial
[09:49:09] <jynus>	 and not be used for mediawiki
[09:49:37] <_joe_>	 yes
[09:49:45] <_joe_>	 mediawiki will just get the data structures
[09:49:51] <jynus>	 so I wanted to talk a bit of high level
[09:50:03] <jynus>	 about how we do database configuration
[09:50:15] <jynus>	 etcd is part of it, but not the only part
[09:50:19] <volans>	 Q: from the operational point of view is it ok to just have a way to see the cluster status and then edit a single db object and then re-check the cluster status or do you think you need some sort of cluster-diff before saving the change?
[09:50:41] <_joe_>	 volans: MVP please
[09:50:44] <_joe_>	 :P
[09:50:59] <volans>	 _joe_: I didn't say right now, but to understand the mid-term plan ;)
[09:51:03] <_joe_>	 yes that would be nice for sure, but maybe 2nd iteration?
[09:51:24] <_joe_>	 ack
[09:51:26] <jynus>	 So I created P6953
[09:51:44] <jynus>	 https://phabricator.wikimedia.org/P6953
[09:52:03] <_joe_>	 oh that's juicy
[09:52:13] <jynus>	 and want to know your opinion on it
[09:52:22] <jynus>	 the title is more alarming that it really is
[09:52:27] <jynus>	 so I can catch your attention
[09:52:34] <jynus>	 (which apparently, I did)
[09:52:42] <_joe_>	 I generally agree with the thesis - puppet sucks for configuring databases
[09:53:18] <jynus>	 the main issue right now is that when we reimage and change,eg. a section
[09:53:29] <jynus>	 I have to update 5 things
[09:53:49] <jynus>	 2 on puppet, 1 on internal lists, 1 on mediawiki, and 1 on internal monitoring
[09:54:31] <jynus>	 we also have moved to a model where there is an arbitrary number of instances per server
[09:54:47] <jynus>	 and we do manage ganeti instances or kubernetes on puppet
[09:54:57] <jynus>	 I want a paradigm shift for dynamic things like that
[09:55:03] <jynus>	 *we don't
[09:55:25] <jynus>	 were things are managed, but not staticaly
[09:55:30] <_joe_>	 yes, we are in the middle of a paradigm shift I tried to start 3 years ago :)
[09:56:19] <jynus>	 accounts are hell, everyvody adding them without tracking where
[09:56:29] <_joe_>	 so, from a quick skim of what you wrote, it seems like you want to automate procedures, right?
[09:56:33] <jynus>	 yes
[09:56:45] <jynus>	 and for that, I need to move (some) away from puppet
[09:56:47] <_joe_>	 so for accounts, I'm not sure that's "dynamic configuration", to be honest
[09:57:11] <jynus>	 it may be not, but it is the same issue
[09:57:18] <_joe_>	 the problem is that puppet (and other tools like it) suck at managing those
[09:57:25] <jynus>	 puppet sucks ... exactly
[09:57:35] <_joe_>	 in theory, you'd like db grants and users to be declared in say a yaml file
[09:57:41] <jynus>	 I mentioned a related issue with tracking relationships between roles
[09:57:50] <jynus>	 X need an account on service Y
[09:58:03] <jynus>	 (it may be also how way of using puppet)
[09:58:04] <_joe_>	 and to have a tool to ensure they're like that
[09:58:19] <jynus>	 yes, that would work
[09:58:26] <jynus>	 but also track ips of origin servers
[09:58:31] <jynus>	 and hosts service those services
[09:58:37] <jynus>	 both of which are constantly changing
[09:58:52] <jynus>	 e.g. mediawiki servers need access to core databases
[09:58:59] <_joe_>	 origin servers == applications?
[09:59:19] <jynus>	 but there is not a good way right now to track those set of servers and create the relationship
[09:59:26] <jynus>	 this is only an example
[09:59:27] * volans has to jump on another meeting but will read backlog later
[09:59:41] <jynus>	 and I do not have a firm proposal
[09:59:46] <jynus>	 except to tool it separately
[09:59:52] <jynus>	 or
[09:59:57] <_joe_>	 well, actually puppet would help for that
[10:00:04] <jynus>	 add the functionality to puppet
[10:00:11] <jynus>	 for accounts it is feasable
[10:00:19] <jynus>	 for others, isn't (topology changes)
[10:00:20] <_joe_>	 but we'd need to invest significant time in building what is needed
[10:00:40] <_joe_>	 the one thing you can't do via puppet is what needs coordination between nodes
[10:00:40] <jynus>	 you cannot track sensibly master-slave on puppet
[10:00:47] <_joe_>	 puppet is not designed for that
[10:00:47] <jynus>	 I agree
[10:00:58] <jynus>	 that is why I adde account handling to the issue
[10:01:23] <_joe_>	 what you can do via puppet is gathering the list of ips of all nodes that run a specific service, for instance
[10:01:24] <jynus>	 as a DBA, my initial thoughts is to create a database + monitoring, but I am biased
[10:01:55] <_joe_>	 but that's just a puppetdb query away from any tool, too
[10:02:29] <jynus>	 let me give you a concrete example
[10:02:37] <jynus>	 very quickly
[10:02:59] <jynus>	 I want to build an inventory app to track table schema status
[10:03:11] <jynus>	 and setup rolling schema changes based on that cached state
[10:03:38] <_joe_>	 ok that definitely cannot be managed via puppet
[10:03:40] <jynus>	 so a database (which is the state of all production tables) + a monitoring (retrieving that regularly)
[10:04:06] <jynus>	 same for topology changes, that can only be seen by checking the current state of the instances
[10:04:28] <jynus>	 but doing that, means that we no longer track mariadb::master
[10:04:46] <jynus>	 but mariadb (with state master, sometimes replica)
[10:04:51] <_joe_>	 good candidates for management outside of puppet are things that have the following features:
[10:04:58] <jynus>	 puppet keeps installing packages, config, etc.
[10:05:02] <_joe_>	 - changes are internal to mysql 
[10:05:13] <_joe_>	 - they require cross-node coordination
[10:05:18] <jynus>	 but read only state and replication is handled and tracked outside of puppet
[10:05:36] <_joe_>	 yeah replication topology checks both conditions
[10:05:42] <_joe_>	 schema changes, too
[10:05:44] <jynus>	 same for provisioning
[10:06:06] <jynus>	 is the host empty? clone it from the backups server!
[10:06:23] <_joe_>	 if we had the switchdc spinoff, most of those things would be almost trivial...
[10:06:35] <_joe_>	 I have the same issues with etcd, by the way
[10:07:04] <_joe_>	 I want to manage failovers, rolling restarts, full backup/recovery of the cluster from a disaster
[10:07:10] <_joe_>	 switching replica direction
[10:07:25] <_joe_>	 all things that I could mostly solve if I had that spinoff
[10:07:26] <jynus>	 basicaly make our bare meta install a bit more cloud-like
[10:07:28] <_joe_>	 :)
[10:08:13] <jynus>	 "these servers are databases", but exactly the role ther are doing is dynamic based on needs
[10:08:29] <_joe_>	 yes, I get it. For the provision part, that can be solved by having a systemd timer set OnBootSec that runs a script that does that check :)
[10:08:33] <jynus>	 so in a nutshell that is my biggest issue with puppet roles
[10:08:36] <_joe_>	 you mean the sections installed as well
[10:08:45] <jynus>	 not that the style is wrong
[10:09:06] <jynus>	 is that things that are roles right now, shouldn't be on puppet
[10:09:11] <_joe_>	 your problem is you want to change and mix the way machines are configured on the fly
[10:09:18] <jynus>	 not fully
[10:09:41] <jynus>	 just content and a small subset of the config/state
[10:09:54] <_joe_>	 you consider those servers as the PaaS for your mysql services, let's say
[10:10:07] <jynus>	 again, not the full way
[10:10:14] <jynus>	 but to some extent
[10:10:30] <jynus>	 e.g. not need full configurability
[10:10:31] <_joe_>	 yeah, I got it, or I would've proposed to use kubernetes and statefulsets :D
[10:10:52] <jynus>	 just here is a database server, most of the config on start is the same
[10:11:12] <jynus>	 but how mediawiki uses it, and if it is a master or a replica, it is dynamic
[10:11:50] <_joe_>	 so it's ok for the list of instances for a server to be static
[10:12:09] <jynus>	 that is the part I am not 100% sure
[10:12:19] <jynus>	 in most cases, yes
[10:12:22] <_joe_>	 and the correspondance instance <=> section too, I guess?
[10:12:44] <jynus>	 there may be some cases where moving instances around, long term, may be needed
[10:12:55] <jynus>	 specially now that we are going multi-instance
[10:13:07] <jynus>	 an backups with autoprovisioning
[10:13:17] <jynus>	 but that is not an immediate need
[10:13:28] <jynus>	 for now, the number if instances is fixed, but different
[10:13:40] <_joe_>	 so, I think I get your issue with how we define "roles" in puppet
[10:13:42] <jynus>	 so I put them on hiera, but I don't like that
[10:13:54] <jynus>	 and I am not following the style
[10:14:18] <jynus>	 (eg. dbstore2001 and dbstore2002 have an arbitrary number of instances)
[10:14:42] <jynus>	 the more immediate change
[10:14:46] <_joe_>	 in this scenario, you basically want a generic profile::mariadb::multiinstance applied to your servers, say, and have hiera define which ones, per host
[10:15:05] <jynus>	 would be all hosts are "core" (for mediawiki)
[10:15:19] <jynus>	 and master/replica is handled somewhere else
[10:15:27] <_joe_>	 I don't think it's bad per se, you are managing your infrastructure under different premises than the rest of it
[10:15:47] <jynus>	 as I said, like kubernetes or VMs state
[10:15:56] <jynus>	 with many buts
[10:16:09] <jynus>	 and simplifying things
[10:16:18] <_joe_>	 and you could keep the role/profile structure (which is good imho), and just use hiera on a per-host basis, or via (ugh) regex.yaml
[10:16:29] <jynus>	 well,
[10:16:50] <_joe_>	 this is quite reasonable overall, and we can remove things from puppet's preying hands if we need to
[10:17:55] <jynus>	 I don't think you will like this: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/dbstore2001.yaml
[10:17:58] <_joe_>	 to be honest, I never got before what your vision was, hence the confusion on the topic
[10:18:48] <jynus>	 (forget about the actual structure, I mean having one of those per host)
[10:18:50] <_joe_>	 jynus: no I think with the help of puppet4's magic we can find a way for you to define a single hash of properties, basically, to feed to a common-purpose profile
[10:19:06] <_joe_>	 jynus: I think it's ok if an host is unique
[10:19:19] <jynus>	 that is another possibility, not sure if the best option either
[10:19:21] <_joe_>	 and it's logically not tied to others
[10:20:15] <jynus>	 so at least I think I was able to trasmit my needs and make you understand my vision
[10:20:24] <jynus>	 no need to seach for a solution now
[10:20:31] <jynus>	 but you have now the background
[10:20:39] <_joe_>	 yes, so I would ideally see a puppet structure for this:
[10:21:06] <jynus>	 and that is the part that puppet can do
[10:21:14] <_joe_>	 - role::mariadb (includes standard, profile::firewall, ..., profile::mariadb::multiinstance)
[10:21:20] <jynus>	 I belive there are others that will have to be handled separatelly
[10:21:39] <jynus>	 (account managemnt and tracking, topology)
[10:21:55] <jynus>	 e.g. etcd may be helpful not ony for mediawiki
[10:22:09] <_joe_>	 - profile::mariadb::multinistance (includes firewalling, monitoring for all instances defined via hiera, and defines all the corresponding mysql::instance entries)
[10:22:12] <jynus>	 but also as the reference for many other things, even if they are triggered by puppet
[10:22:29] <_joe_>	 s/mysql/mariadb)
[10:22:31] <jynus>	 _joe_: that is almost done
[10:22:44] <jynus>	 of course, there is a lot of pending migration
[10:23:12] <jynus>	 so many roles that have to be moved to profiles
[10:23:24] <jynus>	 duplication between core and core_multiinstance
[10:23:28] <_joe_>	 and for provisioning, you can really solve it by adding a script that runs OnBootSec and runs after the corresponding mysql instance is up
[10:23:30] <jynus>	 artifacts of the migration
[10:23:33] <jynus>	 yes
[10:23:40] <jynus>	 I am ok with puppt triggering changes
[10:23:42] <_joe_>	 and that checks if the db is empty and provisions it
[10:24:00] <jynus>	 but then they have to check on a script logic beyond puppetdb
[10:24:19] <_joe_>	 the script is local to the machine, puppet doesn't trigger it
[10:24:22] <jynus>	 (e.g. etcd- what does this instance should contain?)
[10:24:34] <_joe_>	 but it can configure it
[10:24:35] <jynus>	 the script can be provisioned by puppet
[10:24:40] <jynus>	 exactly
[10:24:47] <_joe_>	 both the script and its configuration
[10:24:50] <jynus>	 I think you get my vision
[10:25:06] <_joe_>	 yes, I think most of this is doable with a decent amount of effort
[10:25:08] <jynus>	 the exact details depend
[10:25:19] <jynus>	 there are things like provisioning that can wait for a puppet run
[10:25:39] <jynus>	 other things like monitoring, in most cases, should be in sync with the actual configuration
[10:26:11] <_joe_>	 oh yeah, devil is in the details and all that
[10:26:26] <jynus>	 so with that I hope you better understand my complains
[10:26:41] <jynus>	 which you seem to agree with once I express myself better
[10:27:31] <_joe_>	 I just didn't understand the base of that train of thoughts
[10:27:45] <_joe_>	 because I was looking at what we have today
[10:28:00] <jynus>	 yeah, I also see that if you are doing a lot of apache work
[10:28:01] <_joe_>	 and that could fit easily a different scheme
[10:28:13] <hoo>	 This has probably already been looked at and I missed it… but labsdb1011 is terribly lagged
[10:28:15] <jynus>	 (apache here means application servers)
[10:28:22] <jynus>	 hoo: yes
[10:28:49] <_joe_>	 jynus: yes, I'm trying to standardize those machines further
[10:28:58] <jynus>	 you have roles, I guess, like api, imagescalers, etc.
[10:29:01] <hoo>	 Ok :)
[10:29:16] <jynus>	 which are "roles" different than "s1 for vslow"
[10:29:50] <jynus>	 here I am talking about the general concept of role
[10:29:55] <jynus>	 not what we use for puppet
[10:30:24] <jynus>	 so we reuse the same word, and confussion starts
[10:31:05] <jynus>	 but yes, I am lately using MySQLAsAService really meaning it
[10:31:41] <jynus>	 so thing like "misc servers" in the future are really hardware resources for many different services
[10:32:46] <_joe_>	 which is basically what we do with kubernetes
[10:33:02] <jynus>	 but less hardcore
[10:33:03] <_joe_>	 only difference is, you get no benefit from being able to spin up 100 instances in 1 minute
[10:33:10] <jynus>	 we do not need that
[10:33:14] <_joe_>	 and you get a huge penalty for running into a container
[10:33:22] <jynus>	 just be less static that we are now
[10:33:28] <jynus>	 just a bit
[10:33:38] <_joe_>	 we don't need that also because how long does it take to load a dump into a db?
[10:33:45] <jynus>	 a lot
[10:33:59] <jynus>	 but the oppsite extreme
[10:34:11] <jynus>	 is one hw server == instance, forever
[10:34:17] <_joe_>	 yes
[10:34:22] <jynus>	 we need a bit more flexibility there
[10:34:24] <_joe_>	 which is where we're coming from more or less
[10:34:41] <_joe_>	 if we ordered hardware by the 100s every quarter
[10:34:47] <_joe_>	 it could be an handy abstraction
[10:35:35] <jynus>	 so I just want the automation to do that, and we are actually building it right now
[10:37:41] <jynus>	 https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/ is starting to get interesting
[10:39:13] <jynus>	 let's call it a good meeting and we will keep in touch
[10:39:20] <_joe_>	 yes!
[11:04:06] <wikibugs>	 10DBA, 10Cloud-Services: Prepare storage layer for  euwikisource - https://phabricator.wikimedia.org/T189466#4142395 (10Jayprakash12345) 05stalled>03Open
[11:05:23] * volans back and read backlog
[11:05:57] <wikibugs>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4142403 (10Jayprakash12345) 05stalled>03Open
[11:06:06] <_joe_>	 I'm taking a break now
[11:06:21] <volans>	 re:spinoff, I've tried (failing) to get it into the last 2 or 3 quarterly goals, let's see if by any chance I'm able to get debmonitor out early enough this Q and get some time for it
[11:06:43] <jynus>	 what do you mean with spinoff?
[11:07:23] <volans>	 switchdc spinoff to easily code orchestration tasks
[11:07:30] <volans>	 that joe mentioned earlier in the backlog
[11:07:41] <jynus>	 ok, not a huge deal right now
[11:07:49] <jynus>	 until everthing else can be automated
[11:08:09] <jynus>	 volans: quick question
[11:08:28] <volans>	 it is in general though, many others need that ;)
[11:08:29] <volans>	 sure
[11:08:39] <jynus>	 would it be feasable to create a cumin sql transport? and where should we look at?
[11:09:08] <jynus>	 would it need a lot of changes, as it really is not a remote execution thing?
[11:09:19] <volans>	 not that easy in the way you want it (having back sql client objects)
[11:09:44] <jynus>	 should we go on the direction of its own thing?
[11:09:56] <volans>	 there are basically 2 ways of doing it, and in both cases we need to manage ourselves the parallelization
[11:10:10] <volans>	 one is multithreads, easier but with more limitations
[11:10:13] <jynus>	 ah, because clusterssh does taht for you?
[11:10:30] <volans>	 the other is multiprocess, harder to get back python objects
[11:10:56] <jynus>	 well, compare.py already does multiple calls at the same time
[11:11:02] <volans>	 yes it does the multiprocess stuff and async io but parses strings, not python objects from the client
[11:11:18] <jynus>	 why would I need python objects?
[11:11:25] <volans>	 a third way is if there is an async mysql client ofc
[11:11:33] <volans>	 you told me that you don't want to parse mysql output
[11:11:40] <volans>	 but have mysql client objects to play with
[11:12:04] <jynus>	 I don't understand what I mean, that probably was my fault
[11:12:22] <jynus>	 but I guess I meant I could need to maintain a connection
[11:12:46] <jynus>	 what about sharing code for the querying?
[11:13:05] <jynus>	 the puppet to get list of servers?
[11:13:19] <jynus>	 then do the rest on its own?
[11:13:40] <jynus>	 I guess that also doesn't make much sense as we want instances, not hosts
[11:14:17] <jynus>	 I think I would use cumin to do remote calls and that should be enough
[11:14:19] <volans>	 I'd like to add a mysql transport to cumin, let's just design it properly, as I'd like to add one for conftool, that will give you instances for example
[11:14:29] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142413 (10Marostegui) No more errors for the last 6 hours after killing atop. Also no drops or connections errors running the RX original buffers after reverting them as c...
[11:14:57] <jynus>	 so you would centralize the knoledge on etcd?
[11:15:12] <wikibugs>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4142428 (10Urbanecm) The wiki was created.
[11:15:22] <volans>	 I'd like to be able to query for hosts with role MW that are pooled for example
[11:15:23] <jynus>	 not sure about that- if we should only do it for mediawiki databases
[11:15:43] <volans>	 we use conftool for many things, and manage the live state of pooled/depooled there
[11:15:55] <volans>	 being able to query for that will be useful in many cases I think
[11:16:05] <jynus>	 oh, query it, yes
[11:16:14] <volans>	 yes as a backend
[11:16:14] <jynus>	 I don't disagree with that
[11:16:26] <jynus>	 nore sure if it should be the only backend, or the direct one
[11:16:30] <jynus>	 *not
[11:16:45] <jynus>	 e.g. something else that caches etcd
[11:16:45] <volans>	 not the only, cumin can mix queries from multiple backends
[11:17:12] <jynus>	 I will have to think more about what I want to do
[11:17:24] <jynus>	 and then come back with a proposal
[11:17:27] <volans>	 for your doubt of before there are 2 ways
[11:17:40] <volans>	 either we do ssh + mysql and get a string output
[11:17:53] <jynus>	 nah, I can do that already
[11:17:54] <volans>	 or we do mysql client directly and manage python objects
[11:17:57] <jynus>	 and I don't like it
[11:18:02] <volans>	 I understood you wanted the latter
[11:18:03] <jynus>	 and I do not like that either
[11:18:07] <volans>	 lol
[11:18:14] <wikibugs>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4142430 (10Marostegui) a:03Marostegui This needs to be filtered. Assigning it to myself to indicate it is blocked on me before this can be handed over to #cloud-services-team
[11:18:28] <jynus>	 the thing is, the use case is a bit different
[11:18:34] <volans>	 what's wrong in the latter?
[11:18:39] <jynus>	 it is not only for "remote eecution"
[11:18:46] <jynus>	 I may also want monitoring
[11:19:04] <jynus>	 which I may do by caching other sources of truth
[11:19:15] <jynus>	 and automating on top of that
[11:19:24] <jynus>	 including puppetdb/cumin
[11:19:48] <jynus>	 e.g. I check what host there are, I discover the instances using cumin
[11:19:54] <jynus>	 but then I query those directly
[11:20:13] <volans>	 ok, but if you think of a mysql transport for cumin, I guess you want that cumin opens multiple mysql connections and allow you to perform stuff over them, is that correct?
[11:20:16] <jynus>	 so one tool on top of the other
[11:20:37] <jynus>	 yes, but you say it is not trivial
[11:20:55] <jynus>	 or an ugly hack (ssh + mysql)
[11:21:15] <jynus>	 I prefer to do the ugly hack and hav, eg. local toos
[11:21:20] <volans>	 it's just that we need to handle the parallelization ourselves, we don't get it for free from clustershell
[11:21:42] <jynus>	 eg. cuming runs check_health.py locally
[11:21:46] <volans>	 have you tried aiomysql by any chance?
[11:21:50] <volans>	 that's an option too
[11:22:24] <jynus>	 note that unlike remote execution
[11:22:35] <jynus>	 we don't need to do fancy things
[11:23:19] <jynus>	 and I don't have a problem with multiple threads, given we have less than 200 hosts
[11:23:27] <jynus>	 200 instances, actually
[11:23:32] <wikibugs>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4142439 (10Marostegui) This has been filtered. I see a new created user (myself) that has all the critical data redacted on labs hosts. I am going to run a check private data to make comple...
[11:23:50] <jynus>	 the issue is the inventory
[11:25:22] <jynus>	 anyway, I need to think more about this
[11:25:35] <volans>	 me too!
[11:26:01] <jynus>	 if the needs are too dissimilar, I think it is ok to be independent
[11:26:24] <jynus>	 and just share the "knoledge" (e.g. etcd information)
[11:27:07] <volans>	 you can always import cumin and do a query with 2 lines of python to get the hosts, although right now you will not have the instances
[11:27:34] <jynus>	 also for things like etcd, I think it should be the place for mediawiki reference data
[11:28:47] <jynus>	 but it is common to have databases outside of mediawiki with its own handling (infrastrcture information vs. application information)
[11:29:25] <jynus>	 e.g. mediawiki should now know about backup hosts
[11:29:32] <jynus>	 *not
[11:29:37] <volans>	 etcd is our live-state key-value store, I don't see why those couldn't be there in different object types
[11:29:43] <volans>	 independent from mw
[11:29:50] <jynus>	 but infrastructure level should not where to search for those
[11:30:05] <jynus>	 sure, but those are not critical to be dynamic
[11:30:27] <volans>	 but between puppet and etcd I guess you'll get all of them
[11:30:46] <jynus>	 I call puppet "infrastructure information" :-)
[11:30:46] <volans>	 either they are on puppet or on etcd I guess we'll not introduce a third way
[11:30:51] <volans>	 :)
[11:31:31] <jynus>	 actually my conversation with joe was to have some of that somewhere else
[11:31:38] <jynus>	 where, I don't know
[11:32:22] <volans>	 I see it a bit differently, I think that what you need is an orchestrator that applies those changes in a programmatic and safe way across a fleet that needs coordination
[11:32:34] <jynus>	 oh, I agree
[11:32:37] <volans>	 the source of truth can still be provided by puppet or etcd
[11:32:45] <jynus>	 but that orchestrator needs a state
[11:32:46] <volans>	 based if they are static or dynamic data
[11:33:05] <volans>	 does it? or can it just enquire the fleet to get it?
[11:33:05] <jynus>	 I am not onboard (yet) to use etcd for everthing
[11:33:18] <jynus>	 e.g. we have prometheus with state
[11:33:29] <jynus>	 we can query prometheus to see if a host is up
[11:33:51] <jynus>	 there are things beyonf configuration
[11:33:58] <jynus>	 (mostly monitoring)
[11:34:00] <volans>	 sure
[11:34:15] <volans>	 btw that is another backend I'd like to add ;)
[11:34:24] <volans>	 give me instances based on a prometheus query
[11:34:27] <jynus>	 I was talking before of "applying schema changes automatically"
[11:34:54] <jynus>	 that will unlikely be controlled by none of the 3 mentioned
[11:35:17] <jynus>	 or topology changes
[11:35:28] <volans>	 yes, this adds quite some complexity that probably needs it's own thing that uses the other tools
[11:35:37] <jynus>	 I guess topology could be on etcd
[11:35:47] <jynus>	 but you get the idea, the backed is the least problem
[11:35:51] <volans>	 sure
[11:35:54] <jynus>	 how to store it is
[11:36:13] <jynus>	 as long as things are documented and interoperable
[11:36:25] <jynus>	 e.g. cumin can query it
[11:37:01] <volans>	 sure
[11:37:45] <volans>	 I think when possible we should have stateless things, you have an required configuration/topology/etc. and you check the current live state in the fleet
[11:37:54] <volans>	 without keeping a state of the current status
[11:38:04] <jynus>	 yes, that is possible
[11:38:12] <volans>	 for schema changes this is not possible of course and you'll need to keep the state somewhere during the process
[11:38:19] <jynus>	 but I some cases cache is most likely required
[11:38:25] <jynus>	 exactly
[11:38:31] <jynus>	 and as long it is understood it is cached
[11:38:35] <jynus>	 and not the real thing
[11:38:36] <volans>	 yep
[11:38:38] <jynus>	 I think is ok
[11:38:47] <jynus>	 it is just a "tool" cache
[11:38:53] <jynus>	 not a source of truth
[11:39:11] <jynus>	 I think we are in sync now
[11:39:25] <volans>	 agree
[11:39:30] <jynus>	 same happens with etcd, it will be ok to cache it
[11:39:37] <jynus>	 for non-vital operations
[11:40:07] <volans>	 cache output of etcd?
[11:40:28] <jynus>	 yes, mediawiki does it, for example
[11:40:39] <volans>	 for 10s
[11:40:41] <jynus>	 other tooling could do it
[11:40:50] <volans>	 but I would advice against it
[11:40:59] <volans>	 better to watch the keys
[11:41:03] <jynus>	 again, depends on the context
[11:41:04] <volans>	 and be notified when they change
[11:41:10] <jynus>	 not everthing requies real time state
[11:41:23] <jynus>	 and remember you were pushing to extend etcd usage to other things
[11:41:34] <jynus>	 you were the one doing that
[11:41:57] <volans>	 sure, what I mean is that to have an up to date thing from etcd you don't need to query all the time, just query once and then watch
[11:42:04] <volans>	 for modifications
[11:42:04] <jynus>	 I was the one suggesting to use e.g. prometheus, which as 1 minute granularity for e.g alerts on high load
[11:42:27] <jynus>	 I think we have in mind different applications
[11:43:43] <jynus>	 "show a tree of db topology on a web" can be heavily be cached
[11:43:56] <jynus>	 "perform a master failover" cannot :-)
[11:44:11] <volans>	 ofc :D
[11:47:42] <volans>	 I'm going to get something for lunch, to be continued, lot of interesting ideas and applications
[12:40:26] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127027 (10BBlack) >>! In T191996#4139205, @Marostegui wrote: > For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here: > ```...
[12:48:02] <Amir1>	 No matter what I do, deleteAutoPatrol logs in commonswiki gives me slow timer (on read, write is fast)
[12:48:03] <Amir1>	 [Thu Apr 19 12:46:56 2018] [hphp] [23185:7f5766835200:0:000005] [] SlowTimer [21390ms] at runtime/ext_mysql: slow query: SELECT /* DeleteAutoPatrolLogs::getRows  */  log_id  FROM `logging`    WHERE log_type = 'patrol' AND log_action = 'autopatrol' AND (log_id > '156606681') AND (log_timestamp < '20180223210426')  LIMIT 1000
[12:48:14] <Amir1>	 Even turned batches to 100, still the same time
[12:49:43] <jynus>	 that is ok
[12:50:41] <Amir1>	 oh okay, also it will hopefully gets reduced as the table gets smaller (fingers crossed)
[13:33:53] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142669 (10Marostegui) >>! In T191996#4142547, @BBlack wrote: >  > Not that it's probably the issue here, but this probably isn't ideal.  If you look at `grep eno1 /proc/in...
[13:40:32] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142677 (10Marostegui) 05Open>03Resolved a:03Marostegui So, as soon as I started atop, errors came back and packets dropped. So the culprit is clearly `atop`. I am go...
[13:43:12] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4142681 (10Marostegui)
[14:01:07] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129002 (10Marostegui)
[15:09:05] <wikibugs>	 10DBA, 10Cloud-Services, 10cloud-services-team, 10User-Urbanecm: Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4142939 (10Marostegui) a:05Marostegui>03None Everything looks redacted on labs hosts, so all good! This is now ready for #cloud-services-team to create the views.
[15:41:19] <wikibugs>	 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team, 10Patch-For-Review: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4143067 (10Anomie) Someone needs to run https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+...
[15:45:51] <wikibugs>	 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team, 10Patch-For-Review: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#3883142 (10Marostegui) >>! In T184446#4143067, @Anomie wrote: > Someone needs to run https://gerrit.wik...
[18:35:56] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10StructuredDiscussions, 10Patch-For-Review, 10Schema-change: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936#4143759 (10Catrope) >>! In T149936#4132732, @Marostegui wrote: > @Catrope let me know about the one in s3 (it has no writes since 2015...
[19:02:25] <wikibugs>	 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team, 10Patch-For-Review: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4143845 (10Anomie) I can do it myself too, if @Bstorm doesn't want to.
[20:28:10] <wikibugs>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4144141 (10bd808) Ready for the steps described at https://wikitech.wikimedia.org/wiki/Add_a_wiki#Cloud_Services
[23:38:10] <wikibugs>	 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team, 10Patch-For-Review: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4144780 (10Bstorm) I think I'm supposed to hang back in cloud-land rather than pushing out production c...