[06:49:34] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3905346 (10Marostegui) MySQL has been stopped on labsdb1001 (it was already unavailable) and labsd... [06:53:13] 10DBA: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3905350 (10Marostegui) a:05Papaul>03None [06:53:53] 10DBA: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3899829 (10Marostegui) I would suggest we remove db2034 s1 from codfw rc service, reimage it, and make it the new master. [10:14:24] https://phabricator.wikimedia.org/P6591#37194 [10:19:04] 10DBA, 10MediaWiki-Configuration, 10Operations, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084#3905634 (10Joe) p:05Triage>03Normal [10:19:49] 10DBA, 10MediaWiki-Configuration, 10Operations, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084#3905634 (10Joe) [11:07:13] jynus, marostegui: on neodymium there's two screens with "defragment_db1102" and "s7_check", but I'm not seeing any mysql process, so those are probably done/stale [11:08:29] they are his- I believe they are idle, but I would like him to say so- in case there is something to save [11:09:59] I pinged the internal channel earlier the day and he said he moved his mysql work to sarin and logged off, but we can also wait until he's back [11:12:04] let me get the output of the latest commands [11:12:13] and let's reboot, we shouldn't be blocking you [11:12:29] sorry, but we do heavy usage of those servers [11:14:38] uff, it is not a single session, he has 5 "bash tabs" [11:16:25] it's fine from my end, we can also just wait until he's back [11:16:48] all non-DBA use of neodymium is usually shortlived for cumin [11:19:17] there is a "wmf-auto-reimage -c -- mw1259.eqiad.wmnet" not executed on one screen [11:20:21] on neodymium? not seeing it in "ps aux"? [11:20:31] enphasis on NOT executed [11:20:56] ah [11:21:25] I think we should good to go [11:21:43] I should put a quota on manuel's screens anyway [11:22:24] that reimage one was even from my, I closed it now (there was a brief period where wmf-auto-reimage needed to be run in a particular manner in srceen) [11:23:00] restart now [11:23:06] ok, I'm sending a brief headsup to the internal channel and then I'll reboot shortly [11:23:27] there is no process ongoing [11:24:14] yep, rebooting now [11:28:52] neodymium is back up and can be used again [11:45:54] 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3905807 (10jcrespo) I am performing a quick check on these hosts (and at the same time, testing and improving `compare.py`) to double check there is no data loss before decommissioning them. I a... [12:16:39] 10DBA, 10Operations, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3905851 (10jcrespo) [12:16:42] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3905852 (10jcrespo) [12:16:44] 10DBA, 10Patch-For-Review: run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038) - https://phabricator.wikimedia.org/T154485#3905849 (10jcrespo) 05Resolved>03Open Found at least 2 differences on cswiki:db2035:archive: ``` ./compare.py cswiki archive ar... [13:26:33] heh, I am a screen heavy user [13:27:00] When I told moritzm I moved all my stuff to sarin I forgot to clean up my screens, but indeed there was nothing running there [13:33:03] I can give you the latest screens if you want them [13:33:39] no no [13:33:41] no need to :) [13:33:42] thanks [14:12:35] I am thinking of doing an ubuntu->stretch upgrade of es200[1234] [14:14:35] those host do not have explicit roles, just data [14:17:02] it'll likely be a bit tricky since the upstart->systemd migration path is probably not covered very well in Debian, but could work out in the end [14:17:21] they are pure disk hosts [14:17:31] we could delete most of / [14:17:48] alternatively, we could do a manual reinstall [14:18:01] so the main partition doesn't get formatted [14:18:06] but it is riskier [14:20:26] we can test it first on es2004, which has no unique valuable data [14:20:59] actually, an upgrade will probably work fine; as long as some dependency pulls in libpam-systemd (which pulls in the rest of systemd). and after the upgrade (when only stretch apt sources are present), aptitude would show all local packages (IOW Ubuntu packages which were not dropped during the cross-update) [14:21:35] this is not a normal procedure, but honestly, es2001-4 should eventually replaced [14:21:48] but we have 20TB of data there we cannot lose yet [14:21:54] and this should be easier [14:22:57] I am going to change the apt source to that of stretch, see what happens [14:23:08] on es2004 [14:32:24] k, ping me on possible questions [14:36:12] the main problem is going to be if there is a complex conflict or if it doesn't boot after restart [14:37:16] marostegui: many thanks for handling T184832 [14:37:16] T184832: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832 [14:56:45] 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3906141 (10elukey) [14:58:11] 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3363866 (10elukey) The first run completed without any errors, and then another one (cleaning up only daily data) ran as well setting the following: ``` INFO: line 617: Update /var/run/eventloggi... [15:17:35] upgrading libc is horrible because circular dependencies [16:24:48] wow, systemd actually booted! [16:25:24] xdddddddd [16:25:29] I love your faith on it [16:25:30] haha [16:25:41] but I cannot ssh :-( [16:26:19] so how do you know it booted up? [16:26:21] ah, idrac [16:26:23] I think it could be the firewall [16:26:31] misconfigured, blocking all ports [16:26:42] maybe some race condition between the FW and the iface start? [16:27:26] the problem is taht puppet didn't run properly becausd it was confused on the version it was running [16:27:44] I will wait 20 minutes and try from console [16:32:15] dns is not working, though [16:32:27] I think there is a systemd-resolv or something [16:43:05] we need 7.8 TB free to reimage es2004 [16:43:49] or, 12TB removable device, which will be faster [16:45:07] actually, we may have enough space on es2002 [16:45:31] random question, wouldn't be easier to invest some effort on the capability of reimaging a server without destroing the existing /srv (or /a for very old ones) partition? [16:46:01] volans: no, because a copy is still needed [16:46:27] the reimage can (and has gone) wrong in the past, and there is no guarantee that it will come back up [16:46:59] so we still need to perform copies before reimages [16:47:15] in other cases, we do not have the time to send and retrieve 12TB [16:47:40] I could argue than also an upgrade in place might break the data too, although less likely ;) [16:47:50] but yeah I get your point [16:47:51] well, that is why I tested it [16:48:09] it is a nice thing to have for regular reinstalls, but not every time [16:48:39] for core servers, we have 20 copies, so losing one is no big problem [16:49:27] 2yeah [16:49:59] I have freed 9.1T on es2002 [16:50:18] how can you free up 9.1T just like that? XD [16:50:23] I will do a reinstall and copy them from the ssh [16:50:32] volans: do you know what else would be nice [16:50:42] have prepared a netboot recue system [16:50:52] that is way more important than a custom recipe [16:51:55] the d-i should be able to give you a shell and optionally mount the target partitions IIRC [16:52:23] yeah, that is what I am going to do [16:52:43] but wouldn't be nice to have a full-proper recue image? [16:53:15] I am arguing that, in opposition to a custom reinstall image that only serves for a particular partition setup [16:53:49] I guess it could help in some cases [17:30:51] volans: now that I notice, there is already a recipe to reinstall without /srv format, but I don't like to use it without data copying [17:50:16] jynus: es2004 looks just fine from my quick look, nice work. it currently doesn't use the latest kernel, though (KPTI kernel is installed, but needs another reboot) [17:53:36] I am about to reboot it [17:53:41] installer doesn't do that automatically [17:53:48] but I have to upgrade mariadb too [18:20:44] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3906844 (10jcrespo) [18:20:46] 10DBA, 10Operations, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3906843 (10jcrespo) [18:20:49] 10DBA, 10Patch-For-Review: run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038) - https://phabricator.wikimedia.org/T154485#3906841 (10jcrespo) 05Open>03Resolved s2 master and decommed server where checked- tables: ``` archive ar_id logging log_id pa... [18:55:52] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 2 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#3907034 (10Anomie) p:05Triage>03Normal [18:56:09] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 2 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#3907047 (10Anomie) a:05Anomie>03None FYI, the full planned process is: # Preparation:...