[00:41:57] why you banned me? [06:31:26] 10DBA, 10Operations, 10cloud-services-team: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4030976 (10Marostegui) @Andrew which day/time would work for you to get this done? [08:16:51] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4031106 (10Marostegui) [09:03:32] some time this week I'd like to reboot sarin, it currently runs two screens which seem to be db-related: "s1_check" and "T174569", do you have an ETA for those? [09:03:33] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:03:48] moritzm: those can be deleted, let me do that now [09:04:09] done [09:04:27] that was quick, thanks :-) [09:23:43] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4031295 (10Marostegui) [09:26:43] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4031302 (10Marostegui) [09:48:04] as the backup of s4 is done, I am going to deploy a schema change on s4 on codfw after lunch probably. let me know if you prefer me to go with eqiad first and leave codfw alone for now :) [10:10:38] jynus: ^ [10:11:39] cool [10:11:52] go anywhere [10:12:00] cool - will do codfw then [10:37:25] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4031443 (10Marostegui) [10:37:35] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4031444 (10Marostegui) [10:44:58] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031476 (10Ladsgroup) [11:11:03] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031447 (10Marostegui) This is not an easy change as it needs to be done direct... [11:15:47] 10DBA, 10Operations: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:16:17] 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031569 (10Marostegui) p:05Triage>03Normal [11:17:19] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031576 (10Marostegui) [11:17:22] 10DBA: Rebuild user_newtalk on db1052 - https://phabricator.wikimedia.org/T186503#4031577 (10Marostegui) [11:17:24] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188#4031578 (10Marostegui) [11:17:26] 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:17:54] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4031580 (10Marostegui) [11:17:56] 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:36:34] 10DBA, 10Wikidata, 10Technical-Debt: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#4031612 (10Lucas_Werkmeister_WMDE) [12:03:48] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031684 (10Lucas_Werkmeister_WMDE) [12:08:57] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4031710 (10Urbanecm) [12:12:08] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031733 (10Lucas_Werkmeister_WMDE) > failover might be happening on Q4, so not... [12:14:50] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031738 (10Ladsgroup) WMF fiscal and annual year starts from July, so it's Apri... [12:16:20] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4031740 (10Marostegui) p:05Triage>03Normal Let us know when the wiki is created to filter it on labs and apply (or check if we need to apply): T187089 T185128 T153182 [12:17:35] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4031747 (10Lucas_Werkmeister_WMDE) Okay, thanks :) waiting for that seems total... [12:34:31] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4031760 (10Marostegui) In order to replace db1020 (m2 master) and following: https://gerrit.wikimedia.org/r/#/c/399792/3/wmf-config/db-eqiad.php I woul... [13:26:05] akosiaris: do you have time for a relatively simple question? [13:26:12] jynus: yup [13:26:34] I cannot find how to schedule or change the default schedule of a job on bacula [13:26:49] not a simple question then [13:26:54] ok :-) [13:26:55] it's actually a bit of a mess [13:26:58] it is ok [13:26:59] since you know puppet [13:27:04] the actual question is [13:27:09] and the old language pre 4.x [13:27:17] what would you like me to do ? [13:27:19] my change is apprently going to run 1 wed month [13:27:25] I want to run it weekly [13:27:37] (it is the database backups, I hope you are ok with it) [13:27:39] we are talking about https://gerrit.wikimedia.org/r/#/c/416353/ ? [13:27:46] I haven't reviewed yet [13:27:50] https://puppet-compiler.wmflabs.org/compiler02/10313/es2001.codfw.wmnet/ [13:27:55] I am only asking about [13:28:00] Bacula::Client::Job[mysql-srv-backups-latest-Monthly-1st-Wed-production] [13:28:02] for now [13:28:50] can I run that weekly, and is it easy? [13:29:14] it is weekly already [13:29:18] mm [13:29:19] badly named on my side [13:29:20] so [13:29:26] the actual resource is [13:29:31] Schedule { [13:29:31] Name = Monthly-1st-Wed [13:29:31] Run = Level=Full 1st Wed at 02:05 [13:29:31] Run = Level=Differential 3rd Wed at 03:05 [13:29:32] Run = Level=Incremental at 04:05 [13:29:33] } [13:29:40] oh! [13:29:44] :-D [13:29:52] akosiaris@helium:/etc/bacula/conf.d$ sudo cat schedule-Monthly-1st-Wed.conf [13:29:54] sorry, the name was missleading [13:29:58] ofc it is [13:30:03] sorry to disturb you for that [13:30:04] you are not the only one to complain about it [13:30:07] and very correctly [13:30:26] it ended up growing organically and ofc it was badly set [13:30:31] it is ok [13:30:34] that was all [13:30:56] there is more changes on that changeset, but that was the only think I was worried [13:31:02] I have a question btw [13:31:06] ok [13:31:23] so the annual plans has an item about strengthening our backups infrastructure as a whole [13:31:36] the new annual plan that is [13:31:40] yes, this is just the minimum thing to make thing workg [13:31:53] someone should take over this [13:31:54] no, it's actually more detailed than that [13:32:08] we 've even set some targets and measurements and all that [13:32:17] but my question was about your timelines [13:32:18] yes, of course [13:32:59] sorry, not sure if the question has finished? [13:33:00] it's quite generic and it's about my scheduling of time. So when do you think DBA team is going to be ready ? [13:33:15] like, what's your plan ? [13:33:19] ready to take over? [13:33:28] good point... I asked the question wrong [13:33:33] take #2 [13:33:34] well, DBAs are not going to take over, persistance is [13:33:58] persistence person should take bacula from you, when they are hired [13:34:06] so, timewise, what's the rough timeline for database backups. What would you live to have done and by when ? [13:34:12] timeline? if it was me, it would have been 3 months ago [13:34:43] like the new system-- continuously over the next 12 months [13:34:54] but it shouldn't be your problem [13:35:05] I mean, we have to train the new person [13:35:09] and all that [13:35:18] ok so the new backups hire, me and DBA team are probably gonna have interactions about this through the next 12 months [13:35:36] yeah we don't know when the new person will show up [13:35:37] yes, although the less the better? [13:35:53] as in, handover, and not be your problem anymore, that was my thought [13:36:03] I am thinking about next Q's goals, and this is why I am asking [13:36:12] I would not make it a goal [13:36:25] because we cannot trust on when someone will be hired [13:36:30] DBA or op [13:36:55] still, something should be set as a goal at some point in time in the next 12 months [13:37:02] new person should hopefully show up within the next quarter [13:37:06] I have a list of TODOs backup wise [13:37:12] but obviously we cannot make a quarterly goal for next quarter based on that [13:37:15] and I would like to schedule them to happen at some point [13:37:22] yeah, but we know things gets delayed... [13:37:26] I may not be the implementer, but still those TODOs need to be done [13:37:34] ^what mark says [13:37:50] agreed on that [13:37:58] maybe next-next quarter then [13:38:03] akosiaris: I am all for that, precisely hiring a person is to make sure they are done [13:38:25] it's also explicitly part of the annual plan which starts next-next-quarter [13:38:26] so that aligns ;) [13:38:28] I would like also to know your personal envolvment? you like backups but don't have the time? [13:38:36] or you hate them? [13:38:39] both [13:38:40] :P [13:38:50] it's a love-hate relationship [13:38:52] you know what I am asking [13:39:05] you want to get rid of them as fast as possible [13:39:13] no, that's not it [13:39:14] or could you support the person for longer? [13:39:21] even if you don't own them [13:39:25] i think the latter is not optional :P [13:39:32] ok :-) [13:39:41] no, the thing I don't want to support is postgres [13:39:45] ah! [13:39:54] that one I don't want to have anything to do with tbh [13:40:12] mark: I agree that not everyvody does only things they like [13:40:21] but at leasy it should be known [13:40:49] because goals are coming soon, alex [13:41:03] they should have already been here [13:41:14] let's talk soon about that- we cannot do much without the hardware [13:41:26] and that is scheduled for next fiscal, most of it [13:41:30] ok [13:41:35] why are you anxiously waiting for goals alex? [13:41:45] he wants to own one, it seems [13:42:37] I have yet to talk to manuel, but the first think I would do [13:42:44] from the top of my mind [13:43:09] is setting up what we did on codfw back on eqiad for redundancy (but only to bacula on 1 site) [13:43:56] mark: I am thinking of the various things that need to be done and unfortunately our scheduling unit is the goal [13:44:21] would it help you if I created an etherpad and you start listing stuff ahead? [13:44:24] akosiaris: that thing, which is not yet even discussed, will not touch bacula at all [13:45:05] but the scheduling indeed would need fix to work with the new system [13:45:20] mark the enabler <== your title if we ever play DnD [13:45:27] :P [13:45:45] mark I 've already been doing that btw, just not very electronically [13:45:45] yes i understand etherpad is hard for some [13:45:47] because now dumps and bacula will be detached- which is not ideal, but it is the best option right now [13:46:14] jynus: agreed. but that should not be a lot of work [13:47:05] ok, but we cannot take more work until there is more people on the team, because right now we cannot even take all dba work [13:47:47] I had to do the script I sent on review during the weekend to have time [13:48:07] I don't think it is healthy to do that for longer [13:48:29] or sustainable [13:49:23] I was reading yesterday https://landing.google.com/sre/book.html [13:49:32] and it was an eye opener [13:50:01] specially https://landing.google.com/sre/book/chapters/eliminating-toil.html [13:50:17] https://landing.google.com/sre/book/chapters/handling-overload.html [13:50:33] and https://landing.google.com/sre/book/chapters/dealing-with-interrupts.html [13:51:24] it would be nice to work towards that is something other than the name [13:51:50] thank you for confidence [13:52:09] "eliminating toil" is _literally_ in our annual plan [13:52:20] so is much of the rest :) [13:52:23] mark: are you being sarcastic? because I am trying to support that [13:52:39] my first line is sarcasm, yes [13:52:45] the rest is not [13:53:05] I wasn't trying to say we are doing badly, I was tring to push for it [13:53:16] and help [14:18:17] I'm tentatively planning to reboot neodymium tomorrow, are there any DBA screens which won't be done by then? (sarin is back up and can be used again, BTW) [14:47:22] moritzm: ask manuel, I know he is doing some schema changes, but don't know when or where [14:47:36] nothing useful mine, if anything [14:47:36] ok [15:09:21] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T189124#4032147 (10Urbanecm) p:05Triage>03Low [15:12:22] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4032185 (10Urbanecm) [15:12:57] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#3985290 (10Urbanecm) Just because I do not know if that change something I'm adding it using comment as well: this is going to be a fishbowl wiki. [15:17:56] moritzm: I will get back to you in a sec [15:25:30] no hurry at all :-) [15:29:19] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032253 (10Marostegui) I have set to offline 32:2 due to errors. This host has now 2 failed disks. @Cmjohnson do you have some used disks somewhere? at least to replace one of them. We have now 2 spans degr... [15:40:45] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032277 (10Cmjohnson) Let me see what I have for used spare disks [15:53:02] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032306 (10Marostegui) With the two servers disks failed and the server depooled it is struggling to catch up. It is slowly doing... [15:53:35] moritzm: for neodymium, I started a schema change on codfw today, that might last till tomorrow morning or maybe evening. I will get back to you tomorrow if it is finished [15:55:25] marostegui: no hurry, we can simply postpone to Friday [15:55:42] moritzm: I expect it to be finished tomorrow maybe mid day [15:55:46] I will let you know :) [15:56:33] ok :-) [16:02:52] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T189124#4032334 (10Marostegui) Let us know when the wiki is created to filter it on labs and apply (or check if we need to apply): T187089 T185128 T153182 [16:54:55] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4032482 (10Andrew) @Marostegui, I could do it tomorrow or Friday anytime after 15:00 UTC. Next week I'm out Monday, Tuesday, Wednesday. [16:55:47] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4032485 (10Marostegui) >>! In T189005#4032482, @Andrew wrote: > @Marostegui, I could do it tomorrow or Friday anytime after 15:00 UTC. Next week I'm out Mon... [18:13:00] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032774 (10Cmjohnson) @Marostegui I swapped both disks with used disks we had from decommissioned servers. The disks are currently rebuilding. Please resolve this task once it's complet... [19:19:22] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4033042 (10kaldari) @jcrespo: One thing that hasn't been mentioned is that GlobalPreferences will almost certainly re...