[02:16:07] 10DBA, 10Core Platform Team, 10MediaWiki-Redirects, 10MediaWiki-Revision-backend, 10Wikimedia-production-error: Unable to create redirect on dewiki - fatal DBQueryError - https://phabricator.wikimedia.org/T220353 (10Krinkle) >>! In T220353#5095260, @Marostegui wrote: > So this doesn't really like somethi... [05:08:53] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [05:16:47] 10DBA, 10Core Platform Team, 10MediaWiki-Redirects, 10MediaWiki-Revision-backend, 10Wikimedia-production-error: Unable to create redirect on dewiki - fatal DBQueryError - https://phabricator.wikimedia.org/T220353 (10Marostegui) Any related code change that could have touched that function?: https://gerri... [05:34:17] 10DBA, 10Goal: Decommission dbstore1001, dbstore2001, dbstore2002 and es2001-4 hosts* - https://phabricator.wikimedia.org/T220002 (10Marostegui) [05:34:19] 10DBA, 10Goal: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10Marostegui) [05:34:26] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) 05Stalled→03Open Opening as the hosts arrived [05:34:45] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) p:05Triage→03Normal [07:27:51] jynus: 40% of the mw servers have the new config, so I am going to give it some more time and will do a full deploy everywhere [07:28:56] +1 [07:29:04] \o\ |o| /o/ [07:52:33] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) After all the controlled changes on chunks of mw servers, the key have been changed... [08:44:35] 10DBA, 10Patch-For-Review: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui) I will do a x1 codfw dc failover at some point, so we can depool db2034 too (as db2033 is now on DCOps hands for decommissioning). So we'd freed up 4u for the new hosts. It does... [09:05:56] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [09:06:03] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2037, m5 codfw master: ` root@db2037:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380312088E0) Port Name: 1I Port Name: 2I G... [09:34:37] dear DBA team, I just sent you an email [09:36:07] thanks for the heads up [10:01:17] poor dbstore2001:3318 is taking ages to catch up [10:01:55] https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13318&from=now-2d&to=now [10:17:30] I can stop the other replicas [10:17:50] no, it is ok, I am just used to the SSDs :) [12:17:05] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T202367 (10Marostegui) I think we should try, for now, to use these proxies to replace the current ones (on a 1:1 basis as proposed here T202367#4770331, at least leave the active ones runn... [13:44:02] hi team [13:44:09] what are db2014, db2020, db2021, db2022, db2024, db2031 supposed to be replaced with? [13:44:29] those doesn't exist anymore [13:44:59] https://phabricator.wikimedia.org/T176243 [13:45:18] yeah, that's why I was asking [13:45:24] but I couldn't find the decom task [13:45:31] the <= db2031 didn't make it easy to search for :) [13:45:35] haha [13:46:10] those servers were weird because in some cases they were names as dbXX [13:46:15] *named [13:46:32] I think moved from tampa or something [13:47:57] and some weren't appropiately tracked [13:48:47] yeah I'm asking because I'm looking for ancient gear [13:49:27] in what context, searching things that haven't been appropiately updated? [13:49:41] yeah basically [13:49:43] for capex planning [13:49:57] yeah, codfw dbs is were you have to look :-D [13:50:11] but that is already being handled [13:50:23] the oldest thing we have a probably [13:50:28] dbstore* hosts [13:50:32] and dbproxy* [13:50:43] both have already purchased replacements [13:51:26] also es100[1-4] [13:52:08] tracked on T202367 and T220002 [13:52:08] T202367: Productionize dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T202367 [13:52:09] T220002: Decommission dbstore1001, dbstore2001, dbstore2002 and es2001-4 hosts* - https://phabricator.wikimedia.org/T220002 [13:52:48] there is also older db1050-60, but also have replacments ready [13:52:58] I actually commented on T202367 :) [13:53:05] earlier [13:53:47] so the only think that is pending is codfw db20* hosts, which some are going to be purchased, hopefully this Q and the next next year [13:54:54] paravoid: in general, regarding old hard, you don't need to worry too much about dbs, mark has praised us for having things mostly clear [13:55:33] (a different thing is time for setting up new hosts :-P), or hosts we don't know about [13:56:22] ^hope that helps you with planning [13:58:10] ask if you need more clarification, sometimes the replacements are not clear 1:1 (e.g. with the backup refactoring) [14:01:03] marostegui: I saw, I would +1 but I am still not convinced it will work; I don't think we have the time to work on those anyway [14:01:22] what you mean it will not work? [14:01:36] (I agree we don't have the time, it was a comment abouit: hey let's do that) [14:02:05] having enough machines using a single port AND have redundancy [14:02:35] not sure I am following, I suggested to replace at least the active ones with the new hardware on a 1:1 basis [14:02:47] but we have 11 old servers [14:03:05] yeah, but not all of them are active [14:03:12] and 6 new [14:03:18] so it is not 1:1 :-) [14:03:18] so my proposal is to get _some_ replaced, to at least advance [14:03:24] ah! [14:03:26] :-D [14:03:31] https://phabricator.wikimedia.org/T202367#5097096 [14:03:40] "at least leave the active ones running" [14:03:42] active as in master [14:03:43] I don't disagree, I just say to delay the decision [14:03:51] until someone can work on that [14:04:28] I would also replace at least one or two to see if they actually work fine under "load" [14:04:54] and out of the 6, 2 probably needs network changes (rack changes?) [14:04:54] Again, I am not saying to do it now or next week [14:05:06] Don't know, I haven't checked where they are [14:05:11] to handle labs traffic [14:05:33] Again, it was a comment to give my opinion, not planning to work on that any time soon [14:05:38] We don't have the time [14:05:58] that is why I didn't comment, if you said "I am going to work on that nonw" [14:06:04] I said, sure, go on [14:06:08] *would say [14:06:20] but if not, I prefer to delay also the decision [14:06:34] But I still think replacing m1 just to put a new host there wouldn't be a bad idea [14:06:45] it wouldn't be the first time we discover a host going down as soon as it gets load [14:06:52] I don't think we disagree [14:07:44] I just don't want to rush or work twice :-D, so "no comment" on my side, that is my comment :-P [14:08:01] oh no, I am not going to rush or work twice, because I won't even work once! :) [14:08:08] lol [14:08:37] so I think we should ask to manuel in 1 year, he will know more [14:08:52] haha [14:09:06] you understand what I mean with "no comment"? [14:09:07] At some point we also have to evaluate if m5 will ever have a proxy [14:09:40] jaime in one year will know that [14:09:43] xddddd [14:10:09] realistically, we only need 3 proxies + 3 spares + 3 for labs [14:10:17] so 6 for misc + 3 for labs [14:10:30] cause we only use them on m1, m2, m3 [14:10:33] yea, but I made in the past the mistake of taking an early decision [14:11:28] I am not disagreeing, I am just saying "I don't want to decide yet" [14:11:37] oh, definitely [14:11:56] but if we do that, we are saving 4 proxies we don't have to replace, m4 and m5 (active+spare) [14:12:15] again, something to be discussed with Jaime and Manuel from the future :) [14:12:53] you know my original plans for that, I don't thing we have the time for those [14:13:10] so I prefer not to plan again [14:13:13] (for now) [14:13:24] yep [14:13:35] but just to be clear, that is just me [14:13:43] worst case scenario if something is burning we can use those 6 news to replace m1,m2 and m3, and ask to buy 3 more for labs [14:13:45] my no comment was on purpose [14:13:55] I neigher agree or disagree :-D [14:14:04] I think we need a gifv for that [14:14:19] I can go an comment that, if it is clearer, but I think is not productive [14:14:23] nah [14:14:24] no need [14:14:25] :-) [14:14:37] it is actually more like [14:14:46] I agree in theory, but I don't know [14:14:48] ˜/marostegui 16:13> worst case scenario if something is burning we can use those 6 news to replace m1,m2 and m3, and ask to buy 3 more for labs -> we can always do that if we are in a rush to decom them for some reason [14:15:13] for example, I think codfw one are more important [14:15:21] because of dependencies [14:15:24] yep [14:15:47] and I know we decided to purchase those [14:15:48] we can always do m3-codfw-master a cname of the db master on codfw like we do with m5 :p [14:15:54] but I am not 100% convinced [14:16:15] marostegui: I think that is already there for some services [14:16:19] check dns [14:16:20] for codfw? [14:16:23] yep [14:16:25] for some [14:16:30] let me se [14:16:44] or grep m*-master.codfw on puppet [14:16:46] m2 [14:16:48] nly [14:16:50] only [14:16:58] I knew at least one [14:17:00] templates/wmnet:m2-master 5M IN CNAME db2044.codfw.wmnet. [14:17:13] but we also don't have proper misc on codfw [14:17:15] so we need to add m1,m2 and m3 and problem solved!! [14:18:13] no failover for m1, and the other should be replaced [14:18:49] yep, part of the next FY purchases :)= [19:48:24] 10DBA, 10Goal: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10Papaul) [20:43:51] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) [20:44:54] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul)