[06:21:08] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3795470 (10Marostegui) 05Open>03Resolved Thanks @Papaul! ``` root@db2055:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337C9270) Port Name: 1I Por... [07:38:38] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3795610 (10Marostegui) [09:02:31] Any objection to deploy the refactored comments schema change on s3 db1072 (with replicaiton so it replicates to db1095 and labs)? [09:10:54] no, why? [09:11:36] no, just asking if you were going to do something with db1072/db1095 :) [09:11:57] no, that was done already [09:12:10] I would get rid of db1044, that's all [09:12:19] one host less to schema change [09:12:28] yeah, I will not do it on db1044 :) [09:12:58] also I would get rid of db2085 and db2092 [09:13:08] that is a total of 3 hosts less [09:13:39] yep, that will be done too [09:14:03] what should we put as master, db2036? [09:14:43] yeah, that or db2057 maybe [09:14:50] will db2036 out of warranty soon? [09:19:16] maybe [09:22:15] just checked, db2036 is already out of warranty [09:22:57] and db2057 will expire in january 2018 [09:41:35] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3795845 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1096.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017112... [09:59:32] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3795881 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1096.eqiad.wmnet'] ``` and were **ALL** successful. [10:18:19] let's try to merge the s8 patch [10:18:25] I am getting tired of rebasing [10:22:33] ok [10:22:35] let me review it [10:23:17] There is one thing though [10:23:45] there is probably many things [10:23:46] db1099:3318 is helping in s5 right now, as I am building db1096 again (it is getting all the data copied back now, after the reimage) [10:24:00] I am not saying to deploy as is,but to actively try :-) [10:24:24] ok, maybe we can wait? [10:24:30] but let's review it [10:25:03] yeah, I will review it [10:27:44] in any case, that is one change that could be added [10:27:57] the main issue is the definition and the bulk of the servers [11:36:18] I think I fixed your comments [11:36:30] jynus, marostegui: do you have any current long-running mysql tasks running on neodymium? if not, I'd ping the _security channel next whether everyone is fine with a reboot [11:36:43] not me [11:36:49] but let's not do more patches except on top of that patch, it is a real pain to rebase [11:37:45] do you want to deploy that change? [11:37:58] if so, there is still one minor thing to fix I would say [11:38:10] yes, comments are more than welcome [11:38:13] ok! [11:38:15] will do now [11:38:15] it is unrelated patches [11:38:22] that create overhead [11:38:40] let's pause those until we are happy with this [11:38:48] otherwise it is a moving goal [11:38:55] I have to manually rebase every time [11:39:22] ah no actually it is okay [11:39:24] let me give it another review [11:39:34] yes, I am checking with etherpad, too [11:41:02] latest version is not ok [11:41:06] it pools db1110 [11:56:37] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3796157 (10Marostegui) s3 hosts done (I will keep updating this comment with all the hosts): [] labsdb1001.eqiad.wmnet [] labsdb1003... [12:03:48] how do we do s8 deploy? [12:04:03] we deploy and test on a canary? [12:04:13] I mean, either it fails completely or non at all? [12:04:24] Yeah, I would expect a completely breakage [12:04:25] or nothing [12:04:37] technically is a noop [12:04:44] technically, yes XD [12:14:41] can you do it now, or should I take lunch? [12:15:20] Whatever you prefer, we can do it now or after 5pm? (I was planning to go to the gym around 3,30) and be back around 5pm [12:15:33] let's do it n ow [12:15:38] ok! [12:15:41] I think it is a safe change [12:15:47] if tested properly [12:15:48] I think so too [12:16:28] I've opened kibana [12:16:35] and will test on mwdebug1001 [12:16:48] I've got kibana open too [12:16:50] with fatals [12:17:08] I will generate traffic locally before wide deployment [12:17:19] moving to -ops [12:17:22] db1071 is also in read only (just in case) [12:17:24] yep [12:38:13] it is on mediawiki, too [12:38:20] I will check if there is any read-only errors [12:39:29] there was some read only errors [12:39:43] db1071 related? [12:40:04] maybe mwdebug is read only on purpose [12:40:13] it is not ongoing [12:40:27] Ah could be mwdebug [12:40:31] 29 errors [12:40:33] I set it to read-only on my browser specifically [12:40:36] so it could have been me [12:40:38] ah! [12:40:41] cool [12:40:45] :) [12:46:41] so I think we are good [12:47:11] I believe so too [12:47:12] we can now do all kinds of deployments on top of that [12:47:15] I can't see anything [12:47:22] And I can browse finely [12:47:29] Good job! :) [12:47:31] so I do not have to reconstruct s5 loads every time [12:47:38] that was bothering me [12:47:40] yeah [12:47:41] haha [12:47:41] and lead to errors [12:48:04] also, if s5 fails during chrismas, we can failover to s8 :-) [12:48:19] hahaha [12:48:21] indeed!! [14:21:45] technically s8 hosts was created, just not eqiad included [14:21:56] ok [14:22:06] was already [14:22:31] just saying because if not, somthing wrong was happening [14:22:42] (file deleted or somethint else) [14:22:52] I have amneded the commit message [14:23:50] I wonder dbstores? [14:24:01] do we create s8 there after split? [14:24:02] yep, was wondering about that took when I was adding eqiad hosts [14:24:19] I guess so yeah, we have to do it anyways for db1095 [14:24:26] I think noting to do on dbstore1001/2 [14:24:36] We have to create the replication channel only [14:24:37] but yes for dbstore2001 or 2002 [14:25:20] we would need to create the instance no? [14:25:28] yes [14:25:40] but I do not want to create it now, for the same reasons that db1095 [14:25:47] nah [14:25:48] no need to [14:25:52] we take s5 [14:25:55] we duplicate it [14:26:01] and then copy it [14:26:01] actually [14:26:10] we can do that now if it is separate instances [14:26:21] we take s5, we duplicate all of it [14:26:33] and we have it doubled for some days [14:26:47] except replicating from different master [14:26:50] yeah, but we would need replication filters and delete either wikidata or dewiki or otherwise I don't think it fits [14:26:55] no [14:27:01] I literally mean replicated [14:27:03] do we have enough disk space? [14:27:04] :-) [14:27:08] I can check [14:27:15] I think space was not the problem [14:27:22] as much as IOPS/memmory [14:27:48] we can do that after split [14:28:00] we copy it [14:28:08] start replicating from the right position [14:28:17] and drop the corresponding databases [14:28:25] so no gain to do it now [14:28:38] but no trust it will work due to dbstore2001 storage/memory [14:29:10] yeah, let's do it after the split [14:30:23] gym time for me [14:30:42] I will go over other pending task not related to the goal [14:30:49] Cool! [14:31:08] I have a draft of the announcement ticket ready, I will review it when I am back and create the task [16:41:31] 10DBA, 10Analytics, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3796962 (10Nuria) [18:10:20] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3797364 (10Marostegui) db1096:3315 is now replicating [18:44:23] <_joe_> jynus, marostegui still around? [18:44:45] <_joe_> is there a phab task about the mediawiki lb issues I can point to at the techcom meeting today? [18:44:52] <_joe_> I want to raise the issue there as well [18:45:12] <_joe_> I guess it's a bit late now :( [18:47:17] there is [18:47:58] _joe_: https://phabricator.wikimedia.org/T180918 [18:48:15] <_joe_> jynus: thanks [18:48:30] <_joe_> anything else I should stress that's not in the ticket? [18:48:50] well, I would not stress much the ticket itself [18:49:04] but how open they are to, not on mediawiki, but on on WMF [18:49:21] <_joe_> to ditch lb and go with an external load balancer? [18:49:25] to setup some external proxying to avoid solving the problem ourselves [18:49:30] <_joe_> ok :) [18:49:31] it is not that easy [18:49:42] there is some logic that no lb will give us [18:49:45] but you get the idea [18:49:51] or if they have other idea [18:50:00] <_joe_> yeah I said "external load balancer" to mean "an external implementation of what the mw lb does" [18:50:01] basically testing the waters [18:50:25] and looking at how other people are doing it [18:50:39] <_joe_> jynus: I guess this could be matter for writing an rfc once we're convinced on where we want to go, and discuss this with the techcom [18:50:55] well, I would comment it as a first approach [18:50:59] I am open to the rfc [18:51:07] but not if everybody disagrees, etc. [18:51:16] <_joe_> right [18:51:24] <_joe_> well, we kinda have a say in the matter :) [18:51:38] my last comment is T180918#3783354 [18:51:50] that kind of sumarizes how I would go forward [18:52:08] <_joe_> ok, I'll bring this to the attention of the techcom [18:52:12] note it is a security issue, we can make it publick [18:52:16] <_joe_> I see there is a custom policy access [18:52:23] <_joe_> I can just subscribe people [18:52:42] I cannot say how much of a security bug that is [18:52:53] if it is just a really bad bug [18:53:01] or it is really a securiy one [18:53:58] <_joe_> yeah it might give an attacker a way to think of a very effective DOS without much resources [18:54:23] <_joe_> but that might be said of a lot of tickets [18:54:27] yeah [18:54:43] <_joe_> so, not sure either [18:54:46] so my big fear is that I do know know the position of some key people [18:54:59] or if they can think we are not valuing their code [18:55:43] technically most of the load balancer would stay, it is the failover that needs changing [18:55:59] *down detection [18:56:03] <_joe_> well, if the mw lb lacks some features that other tools grant [18:56:26] <_joe_> and we know those tools are industry standards or widely used at scale [18:56:47] it is not that clear- mw also coded in some really nice things that we will not get on the general solution [18:56:50] <_joe_> it's reasonable to propose to use those at the wmf instead of the lb mediawiki gives you [18:57:02] or in some cases, it will require coding it as a proxy handler [18:57:10] on the other side, noone is maintaning that [18:57:35] and doing it at ops layer, at least I (we) can maintain it on top of a generic software [18:58:00] mention it, we'll see the reaction [18:58:01] <_joe_> yeah one issue I have with an in-software lb is [18:58:09] <_joe_> it will need to keep state to be effective [18:58:19] <_joe_> and that's going to get very tricky, very fast [18:58:26] <_joe_> in php moreso [18:58:37] yes, not only it is it is up or down [18:58:47] it is up, but some kind of degraded [18:59:03] additonally, we may be just moving the problem [18:59:11] if php has issues with network errors [18:59:20] php as in, our specific stack [18:59:42] a proxy will solve the mysql connection handling, but we still have to connect to the proxy :-) [19:00:08] anyway, throw it there, genericl and give the general idea we discussed you, manuel and mark [19:00:25] with focus on future mantainability [19:44:20] <_joe_> sorry I was pinged elsewhere then went to dinner, but thanks I got all the info I need [20:58:08] 10DBA, 10Data-Services, 10Goal, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#2546917 (10bd808) [22:11:49] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 3 others: Deploy dropping wb_entity_per_page table - https://phabricator.wikimedia.org/T177601#3798392 (10bd808)