[05:42:58] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4048735 (10kaldari) I agree that we should have consulted with the DBAs earlier. We went through 2 TechComm RfCs earl... [06:52:16] 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4048789 (10Marostegui) 05Open>03Resolved I am going to consider this resolved for now, as it hasn't happened again. Thanks everyone involved... [07:43:10] good morning, In can help with checking misc services for tomorrow's failover. as there's some content on that wiki page from 2016 it's a little confusing, "Deleted/archived schemas" are all schemas which were dropped as part of the 2016 failover, right? then I would remove those from the wiki page [07:45:08] moritzm: Thanks! In some cases the database is still there but unused, that is why it is not removed (I guess), but yeah we should review it and update it [07:50:49] JFTR; I've "reset" the list of owners/volunteers on the wiki page, since it contained the 2016 people and e.g. Arzhel signed up in addition [08:38:23] I 'll add the things I am responsible for as well [08:38:31] <3 [08:58:48] marostegui: could you update the socket location if you are rolling restarting es* servers? [09:00:06] ah! [09:00:08] I forgot it [09:00:10] I will do :) [09:00:31] btw, can you take a look at: https://gerrit.wikimedia.org/r/#/c/419359/ [09:25:16] es2001:/srv/backups/latest/dump.x1.2018-03-13--22-57-37$ ls | wc -l -> 939 [09:25:31] \o/ [09:25:52] wikishared is a bit large, but the others seem appropiate [09:26:33] it is weird to see enwiki.gz.tar :-) [09:27:09] and metadata is not compressed [09:27:51] (that is expected) [09:28:07] so it is looking good overall then? [09:28:26] yes, except the time that it took to do that, which ha to be checked [09:28:41] also x1 is tiny compared with s3 which hasn't been done yet [09:29:19] but that's expected [09:29:22] I also think I have to work on rotating, because the time calculation for purging old ones may not be working with the new format [09:34:09] wait, reviewdbs both on m1 and m2 ? [09:34:43] IIRC you discussed guys this like yesterday, 2 days ago. one of them is a copy or something ? should we just archive/delete it ? [09:37:21] akosiaris: https://wikitech.wikimedia.org/wiki/MariaDB/misc#owners,_(or_in_many_cases_just_people_that_volunteer_to_help_for_the_failover) [09:37:49] I think it is already deleted, but it could be around somewhere, I will double check [09:38:18] there is some you may know somthing about, like testotrs? [09:38:33] jynus: ah then there is a typo. I am removing it from the owners section for m1 [09:38:47] let me check first [09:38:54] in case it is still there [09:39:01] wow [09:39:10] Deleted/archived schemas has bacula in it ? [09:39:21] ok, I am declaring that page slightly corrupted [09:39:34] yeah, that is an error [09:39:54] let me get all the lists and I will fix those [09:40:26] the dbs on the hosts are not particularly easy to follow, too [09:41:35] let me start by creating a ticket just about the failovers [09:41:37] there's a testotrs you ? [09:41:44] said* [09:41:51] I think, let me generate the full list [09:41:56] kill it with fire if there is. it should be only in codfw though [09:42:01] and we have then perfect, updated information [09:42:20] 1 minute [09:45:42] 10DBA, 10Operations: Switchover m1 master to a newer host - https://phabricator.wikimedia.org/T189655#4049075 (10jcrespo) p:05Triage>03Normal [09:45:53] akosiaris: m1 https://phabricator.wikimedia.org/T189655 [09:46:22] 10DBA, 10Operations: Switchover m1 master to a newer host - https://phabricator.wikimedia.org/T189655#4049089 (10jcrespo) a:05Marostegui>03None [09:48:16] jynus: thanks [09:49:00] 10DBA, 10Operations: Switchover m2 master to a newer host - https://phabricator.wikimedia.org/T189656#4049095 (10jcrespo) p:05Triage>03Normal [09:49:02] akosiaris: and m2 https://phabricator.wikimedia.org/T189656 [09:49:39] so testotrs remove? [09:50:02] yes, kill it with ice and fire [09:50:09] and a song on top of it :P [09:50:39] and we archived blog, so I am going to assume to archive testblog [09:50:45] and reviewdb is no longer on m1 I see [09:50:50] nice [09:51:01] I am going to update the page [09:51:10] ok [09:51:11] but please help me if you see something weird [09:55:10] akosiaris: I think for m1 we will wait for a period with no ongoing backups, if such a thing exists, stop bacula and do the switch? [09:57:28] yup [09:58:55] so I would do tomorrow m2 [10:00:05] I think I can handle gerrit- I already broke it yesterday [10:08:13] lol, ok [10:23:48] there will be some spike on lag while I drop testotrs on db1051 and db2044 [10:25:14] 10DBA, 10Operations: Switchover m2 master to a newer host - https://phabricator.wikimedia.org/T189656#4049207 (10jcrespo) [10:26:22] 10DBA, 10Operations: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4049095 (10jcrespo) [10:29:52] 10DBA, 10Operations: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4049215 (10jcrespo) [12:06:11] akosiaris: I am not sure what you mean with your last comment on 419341 [12:06:25] the "this needs fixing" [12:07:03] you mean bacula? puppet? [12:07:26] the /var/lib/bacula/log thing [12:07:35] it's an easy fix I guess [12:07:43] it should be /var/log/bacula clearly [12:07:49] ah, sorry [12:07:52] I didn't saw that [12:08:04] I was focused on the error message [13:59:30] 10DBA, 10MediaWiki-Change-tagging, 10Schema-change: change_tag table needs redesign - https://phabricator.wikimedia.org/T164167#4049843 (10daniel) @TK-999 your proposal assumes that change_tag will have //either// a RC ID //or// a log ID //or// a revision ID. This is not the case! change_tag rows typically h... [14:12:04] 10DBA, 10MediaWiki-Change-tagging, 10Schema-change: change_tag table needs redesign - https://phabricator.wikimedia.org/T164167#4049894 (10TK-999) Ohh, that is true, I'll have to rethink it. Still, it might make sense to consider a different schema in the long run to avoid the previously described issues... [15:05:38] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4050038 (10jcrespo) [15:05:40] 10DBA, 10Operations, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#4050039 (10jcrespo) [15:05:43] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050035 (10jcrespo) 05stalled>03Open [15:06:00] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10jcrespo) a:03jcrespo [15:08:20] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1002.eqiad.wmnet'] ``` The log can be found in `/var/... [15:09:13] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050055 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1006.eqiad.wmnet'] ``` The log can be found in `/var/... [15:19:41] 10DBA, 10Operations, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4049095 (10Marostegui) [15:34:36] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050165 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1006.eqiad.wmnet'] ``` The log can be found in `/var/... [15:37:52] 10DBA, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050180 (10Marostegui) @jcrespo is it ok to proceed with this or you're still checking it? [15:39:03] 10DBA, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050184 (10jcrespo) If you don't mind leaving it like that for some more time, so I can run pt-table-checksum on all misc sections? [15:39:51] 10DBA, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050190 (10Marostegui) Sure - that's perfectly ok! :-) [15:41:13] I have reimaged dbproxy1002 and dbproxy1006 [15:41:27] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050207 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1002.eqiad.wmnet'] ``` and were **ALL** successful. [15:41:47] that means all proxies there are on stretch and with the right socket location [15:42:00] I have now to test the firewall and enable it on the active ones [15:42:03] <2 [15:42:05] <3 [15:42:12] <4 [15:42:20] <5 [15:46:03] I actually don't regret testing the proxy in advance [15:46:31] we would have found the bug earlier or later [16:07:50] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050305 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1006.eqiad.wmnet'] ``` and were **ALL** successful. [17:04:37] 10DBA, 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050605 (10RobH) [17:06:55] 10DBA, 10Operations: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050617 (10jcrespo) a:05Marostegui>03jcrespo wait, robh, I will take this for now- not yet ready for decom. [17:07:50] 10DBA, 10Operations: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050629 (10RobH) I wasn't taking it, was merely tagging in all decom requests with #hw-requests. I left it assigned to @Marostegui ;] [17:08:23] 10DBA, 10Operations, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050631 (10jcrespo) ok. [17:13:57] 10DBA, 10Operations, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050652 (10Marostegui) Yeah I never use the other tags till we have it ready from the DBA side, to avoid all the noise for the DC Ops :) [17:14:38] 10DBA, 10Operations, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050666 (10jcrespo) Oh, if they want the noise, they will get it here :-P [17:17:11] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1043 - https://phabricator.wikimedia.org/T187542#4050670 (10RobH) [17:27:14] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#4050687 (10RobH) a:05RobH>03Cmjohnson Please note that the switch port for this host was not labeled & doesn't show in ethernet switching table. So @Cmjohnson we... [17:58:56] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050907 (10jcrespo) All proxies are now on stretch except the ones for labsdbs (10 and 11). [17:59:08] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050908 (10jcrespo) a:05jcrespo>03None [18:51:36] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4051149 (10Cmjohnson) Disk replaced and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Sp... [18:55:51] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4051161 (10Marostegui) Thanks!!! [18:59:13] 10DBA, 10Operations, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4051170 (10RobH) >>! In T189216#4050652, @Marostegui wrote: > Yeah I never use the other tags till we have it ready from the DBA side, to avoid all the noise for the DC Ops :) no worries, I wasnt sure... [19:42:32] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4051365 (10Marostegui) 05Open>03Resolved All good! Thanks a lot! ``` root@db1073:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name... [23:28:40] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4052028 (10RobH) [23:38:41] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4052057 (10RobH) [23:38:54] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10RobH) a:05RobH>03Papaul Ok, this is now ready for onsite wipe.