[01:54:12] 10DBA, 10Phabricator (2018-02-xx), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974522 (10mmodell) [01:56:11] 10DBA, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974524 (10mmodell) [03:01:13] 10DBA, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974548 (10mmodell) [05:43:19] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187328#3974644 (10jcrespo) [05:51:46] 10DBA, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974649 (10jcrespo) That is not what I asked, I asked you to disable the first one (the alter), not the second one. [06:13:42] marostegui or jynus: ping me when one of you are available to do the phabricator db switchover [06:13:54] we are here [06:13:57] We are both here [06:14:02] :) [06:14:08] but did you see my comment? [06:14:15] * twentyafterfour looks [06:15:08] disable the alter because you've already done it on the slave? [06:15:15] yes [06:15:26] I disabled the other one because it'll run for a long time causing phabricator to be offline during the update [06:15:36] yes, but we cannot do anything about that [06:15:38] but I can disable the other one as well [06:16:07] we cannot do anything about the #2, that has to be executed as is [06:16:08] phabricator doesn't need the column to be populated so it should be fine to run async [06:16:20] we cannot populate it async [06:16:31] but I can run the migration after the upgrade [06:16:31] the migration has to run [06:16:41] why does it have to run? [06:16:45] * twentyafterfour is confused [06:17:09] phabricator won't be broken if it doesn't run, as long as phabricator thinks it ran [06:17:10] well, if you feel you do not need it, sure, but we are not responsible about that [06:17:22] I only care about the alter [06:17:26] right, I've tested it locally on my own phabricator environment [06:17:31] right [06:17:40] please disable the alter [06:17:58] so these are the steps [06:18:04] we will merge the dns update [06:18:24] actually, first we will setup the circular replication? [06:18:55] then set hosts as read only, merge the dns update, wait for it to apply, set db1059 as rw [06:19:11] (we wouls set db1043 as read only before all that) [06:19:44] then you will likely have to restart phab threads due to the thread pooling [06:19:46] ˜/jynus 7:18> then set hosts as read only, merge the dns update, wait for it to apply, set db1059 as rw -> +1 [06:19:49] then you run the migrations you want [06:20:30] actually, no ciscular replication [06:20:43] we will disconnect db1043 as manuel suggested before [06:20:54] as a rollback plan [06:21:07] yeah, just a stop slave should be enough [06:21:09] but please disable the alter first, as it has been already done [06:21:23] marostegui: -we will not change master in the first place [06:21:34] we just need the coords where to reconnect it [06:22:06] ah sure [06:22:16] does phab have a read only mode that is more gracefull that the read only mode on database, that shows an error? [06:22:21] twentyafterfour? [06:22:37] also from which host does phab run? [06:22:37] ok fixed [06:22:45] it does have a read-only mode [06:22:52] currently it's on phab1001 [06:22:55] we need to check when dns is effective [06:23:00] do you plan to change it? [06:23:15] dns may take some seconds to be effective [06:23:16] we have a phab2001 but it's not ready yet [06:23:20] ok, no problem [06:23:35] the phabricator upgrade takes a couple of minutes, during that time phabricator has to be shut down [06:23:51] we need it restarted onece before that [06:24:02] so if we do the swap at the same time then dns should be propogated by the time it needs to start back up. I can manually verify before starting apache [06:24:12] ok that's fine too [06:24:30] so we will migrate it to the database with the alter [06:24:37] so we shut down phab, switch dns, wait, then start again once we verify that it's got the new dns? [06:24:38] before the whole migration [06:24:42] ok [06:24:43] correct [06:24:49] migration has to run on the new one [06:25:00] so this seems we will have 2 interruptions [06:25:35] that's fine with me, the usual phabricator upgrade has 2-5 minutes of downtime so it shouldn't be much worse than that I guess [06:25:50] let's downtime the alerts [06:25:56] beforehand [06:27:11] https://phabricator.wikimedia.org on phab1001 ? [06:27:18] the alert, I mean [06:27:25] also, let's do this on operations [06:28:16] 10DBA, 10Patch-For-Review, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974673 (10mmodell) [06:28:37] I'll take care of icinga [06:56:10] it wasn't that bad, wasn't it? [06:56:16] I mean, not precisely agile [06:56:23] but it is what it is [07:30:46] 10DBA: test - https://phabricator.wikimedia.org/T187425#3974675 (10mmodell) upgrade: success! [07:30:50] 10DBA, 10Patch-For-Review, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974703 (10mmodell) 05Open>03Resolved a:03jcrespo Everything looks good, thanks @jcrespo and @Marostegui [07:31:01] 10DBA, 10Patch-For-Review, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3965756 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbpro... [07:44:34] 10DBA, 10Patch-For-Review, 10Phabricator (2018-02-15), 10Release: Upcoming phabricator upgrade requires unusually long database migrations - https://phabricator.wikimedia.org/T187143#3974755 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1003.eqiad.wmnet'] ``` and were **ALL** succes... [08:17:58] I am going to ugprade mysql and kernel on db1082 (master for s5 sanitarium), as sanitarium is delayed because of alter table, it shouldn't be a big deal [08:18:01] just a heads up [08:22:09] ok [08:24:52] all done [08:46:40] doing a quick check on db1053.eqiad.wmnet before moving it to m3 [08:46:57] 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3974815 (10Marostegui) What is pending here is: - Move dbstore2001 and dbstore2002 under db2034 - Convert db2034 to master role in puppet That should be it. We can th... [08:47:42] we should schedule s1 and s2 failover [08:50:02] or at least decide if we can do that this quarter or next one [08:50:22] I was thinking that it might be worth waiting for the dc failover [08:52:15] s1 and s2 switchover was part of the original goal [08:52:26] was it? [08:52:43] ah no [08:52:44] the original goal was to decomm sorry [08:53:06] it will be part of the next goal, or at least work [08:53:15] yeah, for the next decomm batches [08:53:35] waiting makes sense to me [08:53:47] good [08:54:05] although, we may need the new hosts then to put some as misc [08:54:20] at least some of them [09:41:55] so my plan is to use db1053 as a db1043 replacement and db2042 as a db2012 replacement [09:42:16] db2042 is the one from s1 no? [09:42:19] yes [09:42:22] that we discussed yesterday [09:42:23] good! [09:44:00] I will take a break while the quick check on db1053 is finishing [09:44:07] so I am sure no data is lost [09:44:15] ok! [10:37:02] hi! are you ok with me merging https://gerrit.wikimedia.org/r/c/410412/ btw ? it doesn't have to happen now but ASAP would be nice to have [10:38:18] just +1ed [10:39:28] sweet, thanks marostegui ! [10:39:31] yes, I said yes ok, I just don't have the time to press a button [10:39:38] sorry [10:39:54] s/yes/yesterday/ [10:40:27] jynus: no worries [11:25:06] not sure how closely you're following root mails, but I noticed a SMART error mail (CurrentPendingSector) for db1111 popping up [11:26:20] we are in a meeting [11:27:00] no hurry, just wanted to mention it [11:27:05] thanks [11:49:40] jynus / marostegui: do you guys have some time to go over what i'm putting in the capex budget for databases now? Just on IRC... [11:49:45] oh you may be having your meeting [11:49:48] later today is fine also [11:49:55] yeah, we are in the meeting [11:50:03] want to join? [11:50:13] i'd prefer to do this over irc [11:50:19] oki [11:50:24] hangout last week didn't exactly work better for this ;) [11:50:30] so i'll wait [12:29:01] elukey: ping [12:30:08] we are still in the meeting, and we'd have a question for you elukey about dbstore1002 [12:51:24] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#3975296 (10Marostegui) @elukey we have done our math about dbstore1002 replacement and redundancy for the next two years and this is what we came up with. If we want to keep dbstore100... [12:51:56] elukey: ^ [13:03:38] <_joe_> hi! I come bearing bad news [13:04:10] :D [13:04:13] <_joe_> we need to think of db resources for the staging cluster today, at least in terms of capex. Can any of you two help me with that? [13:04:14] and then I'll be next [13:44:19] I already mentioned mark the needs for staging [13:44:29] marostegui: <3 - 6 hosts??? /o\ [13:44:53] no, 3 or 6 [13:45:09] not from 3 to 6 [13:45:22] depending if you need redundancy or a refresh [13:45:25] staging: 2 hosts per dc [13:45:42] that is to have something similar to beta [13:45:45] jynus: I didn't say from 3 to 6 [13:46:06] ah, I confused <3 with less than 3 [13:46:12] hahahaha sorry [13:46:16] my fault [13:46:24] it was only wikilove to Manuel [13:46:30] not to me??? [13:46:48] of course it was for you too! :D [13:46:51] ah! [13:47:08] :D [13:47:30] <_joe_> jynus: can you expand a bit? 3 servers would give us what in terms of redundancy? [13:47:35] _joe_: regarding staging, 2 per dc was to have something like beta [13:47:58] define new staging (which I understand you do not also have clear) and I could be more accurate [13:48:06] <_joe_> jynus: ahaha yeah [13:48:08] e.g. [13:48:19] what do you think we should simulate [13:48:24] <_joe_> jynus: "magic pixie powder that will solve all of our problems" [13:48:25] the most accurate [13:48:35] <_joe_> that's the best definition I have for now :P [13:48:35] "one instance of each kind" [13:48:44] "replication issues" [13:48:48] "etc." [13:49:08] <_joe_> I think we should definitely have replication, but I think we don't need sections for instance [13:49:10] we obviously have to compromise on something, but we said at least 2 large server [13:49:18] <_joe_> ok [13:49:27] beacuse we put several instances on them AND have replication [13:49:47] so one es, one s*, one x1 and a replica of each would be ok, I guess [13:49:49] I think we can have sections with multi-instance on single hosts [13:49:55] and definitely replica [13:50:16] to have the same split read/write [13:56:09] maintaining sections would help getting example data in it right [13:56:14] I mentioned to joe, realisticly we cannot have more than 4 instances for that machine size [13:57:10] so that is 8 "enwiki-sized" database instances for 2 machines [13:57:57] or somting like 4 s sections + 2 pcs + 2 ess [13:58:29] with some compromise on actual content [14:00:49] I will add that as a comment [14:00:55] o/ [14:00:59] back from lunch [14:01:03] * marostegui reads [14:01:35] add that where? [14:05:45] i'll have lunch now [14:05:51] shall we verify database capex numbers after that? [14:06:05] ok [14:06:44] good for me too [14:07:29] it is diffucult to say. maybe MCR would want to join us and have some extra hardware to have enwiki + wikidata + commons prototypes [14:08:09] maybe we can compromise on performance an buy more disk but worse databases [14:24:04] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3975522 (10Marostegui) 05stalled>03Open a:05Marostegui>03Cmjohnson Hi, this can now proceed with the current hostname (db1115) Thanks! [14:24:06] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3975526 (10Marostegui) 05stalled>03Open a:05Marostegui>03Cmjohnson Hi, this can now proceed with the current hostname (db1115) Thanks! [14:27:46] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3975533 (10Marostegui) Hi, After a chat with @jcrespo we have agreed to rename this host to a normal dbXXXX one, so please can we rename it to: db2093 Please make sure that the... [14:29:04] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2093 (WAS: rack/setup/install tendril2001) - https://phabricator.wikimedia.org/T186123#3975536 (10Marostegui) [15:15:04] MCR hasn't requested any extra hardware for next year [15:15:08] well [15:15:12] they did contact us [15:15:19] and said they shouldn't need any extra hardware for next year [15:15:26] \o/ [15:15:51] so in the event that does change, i'm assuming they will sort that out then [15:17:43] i've shared the capex sheet with you guys [15:17:48] and I will be back in 30 minutes to discuss it [15:17:57] so feel free to take a look already if you want [15:18:07] yellow means: more to do (e.g. verify whether this is correct...) [15:18:35] I still don't have the link [15:19:10] got it [15:29:11] 10DBA, 10Data-Services, 10Security-Team, 10cloud-services-team, and 2 others: Totally exclude 'abusefilterprivatedetails' from 'logging' in public replicas - https://phabricator.wikimedia.org/T187455#3975715 (10MarcoAurelio) [15:41:03] for the backups figures, I guess buying all in Q1 was just an example, right? :-) [15:45:28] yes [15:45:31] if you feel differently [15:45:35] feel free to change or comment [15:45:52] but i'm hoping we have the backups position filled by then [15:45:59] so at least /some/ hardware can get started then ;) [16:03:55] 10DBA, 10Data-Services, 10Security-Team, 10cloud-services-team, and 2 others: Totally exclude 'abusefilterprivatedetails' from 'logging' in public replicas - https://phabricator.wikimedia.org/T187455#3975888 (10bd808) 05Open>03Invalid Based on [[https://phabricator.wikimedia.org/diffusion/EABF/browse/m... [16:05:30] 10DBA, 10Data-Services, 10Security-Team, 10cloud-services-team, and 3 others: Totally exclude 'abusefilterprivatedetails' from 'logging' in public replicas - https://phabricator.wikimedia.org/T187455#3975897 (10MarcoAurelio) Sorry to edit-conflict. Added a patch to mark the table explicitly excluded in the... [16:06:44] 10DBA, 10Data-Services, 10Security-Team, 10cloud-services-team, and 3 others: Totally exclude 'abusefilterprivatedetails' from 'logging' in public replicas - https://phabricator.wikimedia.org/T187455#3975903 (10MarcoAurelio) Also, this won't be avalaible via data dumps, etc? Sorry to be paranoid, but bette... [16:10:43] 10DBA, 10Data-Services, 10Security-Team, 10cloud-services-team, and 3 others: Totally exclude 'abusefilterprivatedetails' from 'logging' in public replicas - https://phabricator.wikimedia.org/T187455#3975916 (10bd808) >>! In T187455#3975903, @MarcoAurelio wrote: > Also, this won't be avalaible via data dum... [16:13:39] thanks for explaining bd808 [16:18:23] 10DBA, 10Wikimedia-Site-requests: Tokyogirl79 → ReaderofthePack - https://phabricator.wikimedia.org/T187458#3975937 (10Cyberpower678) p:05Triage>03Normal [16:23:19] <_joe_> @dbas: do we do backups of parsercache? [16:23:28] <_joe_> or we don't bother? [16:23:47] not only we don't do backups, we do not care if they are lost [16:23:50] no we don't [16:24:22] it would be quite strange that memcache + parsercache on 2 dcs is lost [16:24:45] <_joe_> ok cool :) [16:24:47] even if all would be lost over 2 clusters on 2 dcs, things would be very slow for 10 minutes [16:24:52] (just that) [16:27:10] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T187419#3976006 (10Marostegui) Thanks - it is rebuilding ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 30% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ``` [17:00:39] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#3976202 (10Marostegui) [19:05:19] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#3976697 (10elukey) First of all thanks a lot for all the work put on figuring out a good config! I had a chat with my team and we would prefer not to add budget for all the 6 nodes, bu... [19:20:05] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#3976734 (10Marostegui) So just to be clear, you just want 3 hosts and also decommission dbstore1002 - meaning that there will be no redundancy for the service. Is that understanding cor... [19:26:51] 10DBA, 10Wikimedia-Site-requests: Tokyogirl79 → ReaderofthePack - https://phabricator.wikimedia.org/T187458#3976749 (10alanajjar) @Cyberpower678 & @Nihlus Please update the description like T185795 [20:35:39] 10DBA, 10Wikimedia-Site-requests: Tokyogirl79 → ReaderofthePack - https://phabricator.wikimedia.org/T187458#3977044 (10MarcoAurelio) a:03Nihlus [20:45:38] 10DBA, 10Wikimedia-Site-requests: Global rename of Tokyogirl79 → ReaderofthePack: supervision needed - https://phabricator.wikimedia.org/T187458#3977101 (10Nihlus) [22:58:50] 10DBA, 10Data-Services, 10Security-Team, 10cloud-services-team, and 3 others: Totally exclude 'abusefilterprivatedetails' from 'logging' in public replicas - https://phabricator.wikimedia.org/T187455#3977402 (10Huji)