[05:52:12] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4281781 (10Marostegui) This is great! Thanks for getting on with this - these are my first thoughts! > Depool/pool/warmup >dbconfig [depo... [05:54:22] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4281782 (10Marostegui) p:05Triage>03Normal [05:57:47] 10DBA, 10Gerrit, 10Operations, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4281783 (10Marostegui) 05Open>03Resolved a:03mmodell This is now back to normal values: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var... [06:55:24] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4281840 (10Marostegui) [06:55:48] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4281841 (10Marostegui) [06:56:10] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Patch-For-Review, 10Wikidata-Ministry-Of-Magic: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193#4281842 (10Marostegui) [07:29:40] I wrote my first sql in age: select email_address, count(*) as occurences, group_concat(external_id order by account_id) from account_external_ids group by email_address having occurences > 1; [07:29:49] or really, I love GROUP_CONCAT() :] [07:49:07] I improved the slides 200x [07:51:09] That is now a proper presentation!!! [07:55:29] morning DBAs :) [07:56:38] hello DBAs! [07:56:40] FYI we should start to roll out debmonitor client across the fleet today. It will be very slow and manually controlled, I'll keep an eye on the m2 cluster graphs too of course. [07:57:07] but wanted to give you an heads up, also in case you see anything strange on m2 feel free to ping me [07:58:37] cool! thanks for the heads up :) [08:00:51] FYI i am about to switch dbtree.wikimedia.org from terbium to mwmaint1001. i got the new grants, confirmed they work, document root is very simple and has the PHP file and config with password. tested with apache-fast-test that i get 200 OK.. just flipping the switch in cache::misc and running puppet is left [08:01:03] you dont have any special monitoring on dbtree or something right [08:01:41] Monitoring as something that might page? [08:02:18] yea, something that is looking at dbtree and isnt human users [08:02:35] I have never seen a page for dbtree, so I would assume we don't [08:03:24] ok, ill do it and switching back and forth is easy [08:10:13] merged the cache::misc config, running puppet on misc cp* [08:10:25] watching apache logs for dbtree [08:14:12] it makes a connection to google.com each time ... [08:14:57] it uses google.com/jsapi for visualization-orgchart [08:16:14] https://phabricator.wikimedia.org/T96499 [08:17:05] ah :) [08:19:07] mutante: is wasat going to change too? [08:19:27] need to pupptieze that libapache2-mod-php gets installeds [08:19:31] Hauskatze: yes [08:19:39] * Hauskatze tries to understand where the 1001 comes from [08:19:44] i wasnt 100% sure about the order though [08:19:58] 1 means eqiad [08:20:27] the 1 at the beginning. 1 eqiad, 2 codfw, 3 esams, 4 ulsfo, 5 eqsin [08:20:35] and 001 is just where you start counting [08:20:54] room for 999 maitnenance servers [08:20:54] got it! [08:21:02] thanks [08:21:24] so wasat is going to be... mwmaint2001? [08:21:29] correct [08:21:33] chachi [08:22:36] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282045 (10jcrespo) > I think that if NAME is just the hostname, it should show the config for all the configured HOST:PORT combinations i... [08:23:09] awww.. fail: dbtree is using mysql_connect [08:23:19] and that isnt a thing anymore on stretch with PHP7 [08:23:25] need to convert to PDO or mysqli [08:23:37] PHP Fatal error: Uncaught Error: Call to undefined function mysql_connect() [08:23:56] i will revert first [08:26:14] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282069 (10Joe) >>! In T197126#4280521, @Volans wrote: > Quick first feedback/questions on the proposal: > >> dbconfig get NAME gets you... [08:26:18] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282070 (10Joe) [08:28:39] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282078 (10Volans) @Joe ack to all your replies, thanks for integrating the suggestions! [08:31:13] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282085 (10Joe) >>! In T197126#4282045, @jcrespo wrote: > After thinking for a while, `pool|depool|warmup` (as an interface, not as the id... [08:32:18] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282100 (10Joe) [08:35:51] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282109 (10Joe) >>! In T197126#4281781, @Marostegui wrote: > This is great! Thanks for getting on with this - these are my first thoughts!... [08:36:56] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282110 (10Joe) [08:40:14] <_joe_> thanks for the input, jynus marostegui [08:40:17] <_joe_> very useful [08:40:33] no, thank you for getting on with this, it will be a massive win [08:41:28] marostegui: as you recently informally tagged some hosts as 'candidate emergency master', would it be useful to have that info in etcd too? [08:41:46] I know it's not mw-specific, but kinda related [08:42:07] volans: indeed, that is a good point [08:42:32] 10DBA, 10Operations, 10Traffic, 10WMF-Legal, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#4282120 (10Bawolff) [08:44:57] <_joe_> adding new fields to a schema is easy [08:45:01] <_joe_> removing them is hard [08:45:07] <_joe_> as usual :D [08:49:43] _joe_ volans, please let's start with a much smaller scope [08:49:44] it is ok to plan for the future [08:50:15] <_joe_> what is a smaller scope here? [08:50:25] <_joe_> I mean I'm happy to implement an MVP [08:50:36] <_joe_> but I need you and marostegui to tell me what you consider that [08:51:04] 10DBA, 10MediaWiki-Configuration, 10Operations: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4282151 (10Marostegui) >>! In T197126#4282109, @Joe wrote: >> >> This is not likely to happen in a near future, but as we are starting fr... [08:51:05] sorry, got disconnected [08:51:10] checking the public logs [08:51:12] _joe_: For me, a simply: pool/depool slave [08:51:17] "simply" [08:51:48] <_joe_> marostegui: well most of the things will need to be done anyways, esp in terms of the data model [08:52:03] yeah, that is the hardest part I would say [08:52:13] I was explaining myself: "but with sanity checks and other stuff, let's not overthink right now, because things may change" [08:52:25] "I hope you understand my fears and my willigness to have very very small sprints- like, make read only by section work first" [08:52:59] <_joe_> well the point of writing sanity checks now is to create the structure in the code to support them, mostly [08:53:08] sure [08:53:20] <_joe_> anyways, I'll see what I can come up with [08:53:25] just that is the part that will most likely change [08:53:36] because as we use it, we will make mistakes [08:53:42] and know what to check :-) [08:54:09] (or know which ones prevent us from doing things we want to do) [09:20:12] I've added https://wikitech.wikimedia.org/wiki/MariaDB#Decommissioning_a_mariadb_host_server_checklist [09:20:20] feel free to add more things to it [09:20:53] ah nice [09:21:03] I will add something [09:21:09] I allways forget something [09:21:39] we could even substitute generic steps on decom checklist (any op part) with that [09:22:11] I will add some details on some steps later [10:24:54] jynus: marostegui: shall we discuss (and finalize) goals next week? [10:25:05] mark: we are at the offiste :) [10:25:17] ? [10:25:32] mark: next week during the offsite you mean? [10:25:44] yes :) [10:25:50] then yes :) [10:25:53] hmm need to find a slot before manuel leaves [10:25:56] what time on wed do you leave? [10:26:01] let me check [10:26:16] mark we already kind of decided already [10:26:37] flight leaves at 3:30pm [10:26:42] we only need your ok on language and some details if we can have help for some things [10:26:48] marostegui: so until lunch more or less [10:26:49] ok [10:26:54] kind of yes [10:27:34] what is your proposal? [10:28:14] mark: line #193 [10:28:24] but there are lots of ideas there, and we need to reduce the scope [10:28:34] or get help [10:28:56] from line 193 to 201 [10:29:28] that is what we would like to reach at some point, now, we need to decide what is a doable goal to start with [10:29:35] as doing all that in a Q is impossible [10:30:19] i think we need a concrete plan first of all [10:30:27] is there one already? like a ticket? [10:31:13] We kinda have one https://phabricator.wikimedia.org/T156461 [10:34:48] right [10:34:55] that's a ticket, not much of a plan yet [10:35:09] but at least it breaks it up into some subgoals [10:35:12] yeah [10:35:22] that is the thing, we kind of know where we want to go to [10:35:23] which one do you guys feel is now missed the most? [10:36:38] Probably line 185 [10:37:12] so that's probably https://phabricator.wikimedia.org/T196366 [10:38:04] yeah, that is it [10:38:08] right [10:38:37] so I suppose the most important thing is to get a description on this ticket of what repl.pl does that does not work when the master is unavailable [10:38:41] and what should happen instead [10:39:39] i really don't know the details, but I suspect the problem is that it can't check the state of the master e.g. binlog positions etc [10:39:47] yeah, that is mostly it [10:39:50] there is many issues [10:39:55] how deep we want to go [10:40:02] is what we have not decided yet [10:40:06] so now you guys should figure out and describe here what it can do and should do instead for the most common cases [10:40:19] because some things requires, for example, setup a source of truth [10:40:19] If you check like #188, those are dependencies for line 185 and 186 [10:40:21] and either the goal needs to be or at least include that [10:42:04] or depend on mediawiki/etc/cumin changes [10:43:00] so this needs a lot of discussion still [10:43:05] i propose we have a session on it at the offsite [10:43:14] sounds good to me [10:43:19] too bad joe won't make it [10:43:20] that doesn't help [10:43:21] +1 [10:43:25] but riccardo can join [10:43:25] the idea is clear, the depth an dependencies and scope is not [10:43:28] yeah [10:43:34] and i'm not sure we can finish that in a week [10:43:45] we'll see [10:43:58] it's a very useful and worthwhile goal [10:44:44] the alternative was backups, but we wanted to wait to have the hardware purchased on Q1 and set it up and have a goal on Q2 [10:44:45] ...and we should probably replace repl with a modern python script along the way :P [10:44:55] yes, that is the plan [10:45:02] the hardware has already been ordered to some extent [10:45:07] ? [10:45:17] partially [10:45:23] db specific hardware [10:45:27] yeah not that [10:45:35] we may need to do two goals [10:45:37] two small ones [10:45:41] so a bit on backups, and some on this [10:46:01] this, on its minimum expresion is doable [10:46:16] "writing 2 scripts" [10:46:28] and yeah [10:46:37] if we do this automation goal I want to make sure we have riccardo involved [10:46:43] strongly involved [10:46:49] that was our main question [10:46:50] and we're hiring him help ;) [10:46:54] will he have him? [10:47:01] i have no idea what riccardo/faidon are planning for next quarter [10:47:03] will find out! :) [10:47:07] if not, we will reduce it a lot [10:47:29] yeah, not decided yet, bunch of ideas as always [10:47:42] same for etcd/app server [10:47:52] yes [10:47:58] if they don't have time and have other priorities (understandable) [10:48:05] we'll see [10:48:09] we will delay that [10:48:17] and fallback to backups [10:48:17] i do consider this pretty important and something we (you!) spend a lot of time on [10:48:42] the problem with backups is taht without hardware there is not much to progress :-/ [10:48:46] of course [10:48:53] we could do software [10:48:55] we can start working on ordering that immediately [10:49:01] it -should- not take long [10:49:02] yeah, we could work on monitoring [10:49:04] binary backups [10:49:04] (alerting) [10:49:05] (but I said that last time too eh) [10:49:06] monitoring [10:49:06] stuff like that [10:49:11] yes [10:49:16] binlog backups [10:49:17] etc. [10:49:27] that has very little dependencies except our time [10:49:45] and have it ready for when hardware arrives [10:50:36] yes [10:50:45] good thing is [10:50:53] both automation and backups are part of our annual plans for the next fiscal year [10:50:55] which starts now [10:51:13] (even better where they combine ;) [11:22:31] mark: Check line #182, we have developed a bit the backups goal proposal, instead of the automation one which might be for Q2 [11:22:55] i hope we can make some progress on both if possible [11:23:51] that looks good yes [11:23:59] Yeah, but both as goal might be too much, we can always work behind the scenes [11:25:29] we can reduce the goals so they're not too much [11:26:01] for example, for the db replica change automation it sounds we have some more discussion and planning to do [11:26:05] that by itself could be a goal perhaps [11:26:23] like a plan? [11:26:32] or some implementation details? [11:26:40] yeah, to have a concrete plan so it can easily be implemented in Q2 [11:27:03] with /some/ implementation details, as I described above [11:27:15] so right now the ticket says that repl.pl doesn't work but it doesn't explain why [11:27:30] so we need to figure out why it doesn't work and how we want the replacement to work instead [11:27:44] that's largely DBA specific knowledge, too [11:30:12] yeah, I get it (I think) [11:30:16] but hey [11:30:19] So let's discuss that part in the offsite [11:30:22] it's entirely possible we all sit in a room next week [11:30:26] and come out with a proposal that works [11:30:29] before the goal even starts [11:30:31] who knows! [11:30:37] so the backups one looks good? [11:30:40] and then riccardo implements it next week ;p [11:30:50] backups proposal mostly looks good [11:30:54] language needs some tweaks still [11:30:56] yeah [11:30:56] but i can help with that ;) [11:31:04] so, let's discuss the other one next week [11:31:10] * volans hides [11:33:42] made a few tweaks already, what do you think? [11:34:09] looks good! [11:41:07] so how would we detect 'incorrect generation'? [11:41:49] that is part of the goal, generate stats and detect anomaly states [11:42:11] which ones are to be seen, but from the top of my mind: 0 size files [11:42:22] tables with less than 80% of the previous size [11:42:31] tables with over 20% of the previous size [11:42:38] objects disapearing [11:42:55] (new databases, new tables, no databases, no tables) [11:43:16] based on lists we have of things that are expected and previous week stats [11:43:34] we were discussing if to do stats on prometheus or on tendril [11:44:15] that is aside from the obvious thing- backups script returns non-0 state and parsing of log files :-) [11:44:54] the advance monitoring would be outside and be part of backup testing [11:45:01] *not part of the scope [11:45:18] -recovering to the test hosts and let it replicate/run compare.py on them [11:45:36] right [11:45:38] that is backup testing and it is a separate goal on itslef [11:45:50] yes agreed [12:58:00] 10DBA, 10Data-Services, 10cloud-services-team: Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running - https://phabricator.wikimedia.org/T184540#4282705 (10Marostegui) Lately the way @Bstorm and myself have been working with this is basically depooling the hosts. She'd send th... [13:00:19] 10DBA, 10Data-Services, 10cloud-services-team: Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running - https://phabricator.wikimedia.org/T184540#4282720 (10jcrespo) > There is probably not much stuff we can do about this rather than just coordinating and have good communication... [13:19:05] 10DBA, 10Data-Services, 10cloud-services-team: Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running - https://phabricator.wikimedia.org/T184540#4282769 (10Marostegui) 05Open>03Resolved