[00:11:27] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [01:06:23] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) I tried to PXE boot the first server, on the switch side everything looks good since I can see that the switch learned the MAC addres... [01:06:43] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Dzahn) @ayounsi @RobH These servers have an install issue where they get a DHCP ACK followed by "Serving stretch-installer/debian-installer/a... [01:07:57] 10DBA, 10Operations, 10ops-codfw, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Dzahn) [02:00:06] 10DBA, 10Operations, 10ops-codfw, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) The issue was in the BIOS setting. The boot mode was set to UEFI after changing it to BIOS it works. [03:11:18] 10DBA, 10Operations, 10ops-codfw, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [05:11:22] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1126.eqiad.wmnet'] ` The log can be found in `/v... [05:23:22] 10DBA, 10Patch-For-Review: Prepare to decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui) [05:23:47] 10DBA, 10Patch-For-Review: Prepare to decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui) db2034 is no longer a master, I will give it 24h before starting the decommissioning steps [05:27:43] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1126.eqiad.wmnet'] ` and were **ALL** successful. [05:27:55] 10DBA, 10Operations, 10ops-codfw, 10Goal: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) I can confirm db2103 looks good. ` root@db2103:~# free -g total used free shared buff/cache available Mem:... [05:29:49] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1126 installed correctly: ` root@db1126:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sha... [05:29:59] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:31:24] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1128.eqiad.wmnet', 'db1129.eqiad.wmnet', 'db1130... [05:35:02] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [05:35:35] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) p:05Triage→03Normal [05:37:54] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [05:46:51] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1130.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1130.eqiad.wmnet'] ` [05:49:04] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1130.eqiad.wmnet'] ` The log can be found in `/v... [05:50:06] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1128 and db1129 installed correctly: ` root@db1128:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used fr... [05:50:38] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:04:41] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1130.eqiad.wmnet'] ` and were **ALL** successful. [06:05:30] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1127.eqiad.wmnet'] ` The log can be found in `/v... [06:06:50] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1130 has been installed correctly: ` root@db1130:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used fre... [06:07:08] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:21:56] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1127.eqiad.wmnet'] ` and were **ALL** successful. [06:22:29] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @RobH @Cmjohnson I have seen that the idrac for db1127 was working already so I have grabbed the MAC for the NIC and added the DHCP entry for it. So... [06:23:20] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:23:55] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1131.eqiad.wmnet', 'db1132.eqiad.wmnet', 'db1133... [06:47:29] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1131 installed correctly: ` root@db1131:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sh... [06:47:42] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:48:01] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1132.eqiad.wmnet', 'db1133.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1132.eqiad.wmnet', 'db113... [06:50:14] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1131.eqiad.wmnet', 'db1132.eqiad.wmnet'] ` The l... [07:07:31] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1131.eqiad.wmnet', 'db1132.eqiad.wmnet'] ` and were **ALL** successful. [07:12:22] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) I am troubleshooting db1133's RAID, which is OFFLINE due to several disks being OFFLINE [07:18:32] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1134.eqiad.wmnet', 'db1135.eqiad.wmnet', 'db1136... [07:36:20] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1136.eqiad.wmnet', 'db1134.eqiad.wmnet', 'db1135.eqiad.wmnet'] ` and were **ALL** successful. [07:38:50] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1132 installed correctly: ` root@db1132:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sha... [07:39:23] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [07:39:57] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/v... [07:40:33] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) >>! In T211613#5163463, @Marostegui wrote: > I am troubleshooting db1133's RAID, which is OFFLINE due to several disks being OFFLINE The RAID is now... [08:17:17] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1133 had another issue: On reboot to go for an install and while connected on the idrac this is what I get: ` Unified Server Configurator does not... [08:35:07] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) And now db1133 on reboot: ` FW could not sync up config/prop changes for some of the VD's/PD's Press any key to continue, or 'C' to load the configur... [08:36:22] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1137.eqiad.wmnet', 'db1138.eqiad.wmnet'] ` The l... [08:52:34] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1138.eqiad.wmnet', 'db1137.eqiad.wmnet'] ` and were **ALL** successful. [08:56:40] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [08:59:04] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) db1137 installed correctly: ` root@db1137:~# free -g ; megacli -LDInfo -Lall -aALL ; df -hT /srv total used free sha... [08:59:22] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [09:00:04] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) The only pending host to install is db1133 which is having issues and we need on-site help from @Cmjohnson (T211613#5163570) - I have already pinged... [09:04:57] 10DBA, 10Goal: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [09:21:46] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [10:05:57] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [10:06:04] 10DBA, 10Operations, 10Patch-For-Review: correctable memory errors db1068 (commons primary master database) - https://phabricator.wikimedia.org/T213664 (10Marostegui) [10:50:47] so I am having some issues with ongoing compression [10:50:54] I hope those will be fixed now [11:47:28] jynus: I am going to start provisioning hosts soon, any specific tool/way you want me to use to populate them that might help you with backups testing/bug hunting? [11:47:49] so it is not as much what I would want [11:47:57] but as if I could be of help [11:49:15] I have here the code https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/500043/ [11:49:23] to transfer an existing snapshot [11:49:49] but I only have snapshots of x1 and s6 at the moment [11:50:41] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [11:51:22] jynus: do you want to merge that and provision the x1 and s6 hosts to test it? [11:51:50] I don't know, what were you going to do? [11:52:05] https://phabricator.wikimedia.org/T222682 [11:52:11] That is the list of hosts and where they will go [11:52:14] no, I mean [11:52:19] how were you going to do that? [11:52:45] I have this up: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508523/ to start with x1, but I am happy if you want to use your tool with those in x1 [11:52:48] so you can start polishing it [11:53:01] jynus: I was going to use the backups sources to provision (most likely) [11:53:04] remember also transfer.py works without stopping the server [11:53:10] so don't use those [11:53:19] because I am in the middle of compressing them [11:53:28] gotcha [11:53:35] or ping me first [11:53:37] not a problem [11:55:05] you can use the s1/s6 one [11:55:15] backup sources? [11:55:25] let me suggest using transfer.py --type=xtrabackup with mysql stopped [11:55:31] sounds good [11:55:34] not stopped [11:55:38] replication stopped [11:55:41] repl stopped? [11:55:42] yeah [11:55:42] that [11:55:48] so that host would be [11:56:24] oh [11:56:29] that is eqiad [11:56:38] do you want to do the x1 ones with https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/500043/ to polish that script? [11:56:40] I am compressing eqiad [11:56:42] ah [11:56:54] no problem, I will take a normal slave for now, not a big deal :) [11:57:06] I will use xtrabackup with a normal slave [11:57:38] let me see which ones have ongoing compression [11:57:46] db1139:3311 [11:57:58] db2099:3315 [11:58:12] db2098:3312 [11:59:17] and one other [12:01:17] db2098:3313 [12:01:38] so you can use on eqiad anyone except s1 [12:01:44] excellent [12:02:24] keep here update which ones you are using, I will do the same [12:02:32] cool! [12:02:33] will do [12:03:16] so FYI [12:03:20] pending changes are [12:03:44] https://gerrit.wikimedia.org/r/502828 Allow new option --stop-slave for xtrabackup transfers [12:04:46] which will be eventually used by snapshoting https://gerrit.wikimedia.org/r/501546 [12:05:11] and https://gerrit.wikimedia.org/r/500043 "Allow for a 3rd transfer type: decompression" [12:05:29] which will be the base for getting a snapshot recovered [12:05:33] they are all WIP? [12:05:52] WIP meaning code written but untested/non deployed [12:06:17] what works and is tested is the --type=xtrabackup, which is used to take the snapshots [12:06:31] and you can use it too [12:06:34] so can I stop replication to use db1140:3320? [12:06:44] yes [12:07:00] it may be faster for you [12:07:07] but also may help debug issues [12:07:34] improve documentation, etc [12:07:46] specially I would like you to get accostumed to the syntax [12:07:53] and give me feedback [12:07:56] I will use transfer.py --type=xtrabackup then [12:08:16] host:socket host:sqldata_location [12:09:05] one important think [12:09:07] *thing [12:09:32] make sure you don't use the same port twice for 2 transfer to the same hosts [12:09:53] it will complain, but better not try [12:09:58] haha [12:09:59] ok [12:10:16] the source and the place were you run it doesn't matter [12:10:37] only the target host (in general, I only do 1 transfer to a host at a time) [12:11:01] you can run as many as you want at the same time to different hosts [12:13:56] so it is something like --type=xtrabackup db1140.eqiad.wmnet:/run/mysqld/mysqld.x1.sock db1127.eqiad.wmnet:/srv/sqldata [12:14:00] right? [12:14:16] probably you want --no-checksum --no-encryption too [12:14:22] yep [12:14:37] but I belive that is [12:14:49] also remember that doesn't prepare the backups for you [12:15:03] yeah [12:15:25] it is transfer because it only handles the transfer :-D [12:15:31] hehe yeah [12:15:32] the other parts are WIP [12:15:37] I will document all these commands for now [12:16:01] in the future, that should be done with [12:16:13] should I do it here? https://wikitech.wikimedia.org/wiki/MariaDB/Backups [12:16:19] transfer.py --type=decompression [12:16:33] so snapshots have already been pre-prepared [12:16:44] but we don't have all of them yet [12:17:05] maybe we should document transfer.py on its separate page [12:17:09] and link from backups [12:17:24] yeah, I just want to create a quick page there to dump all these commands for now [12:17:29] I was planning to do that [12:17:41] let's create a cheatsheet on that wikipage? [12:17:42] sure, do that on a separate page [12:17:50] I will fill it in later [12:18:11] because later that will not be part of the backups [12:19:20] we will just have a provision command that will run that transparently [12:19:46] e.g "provision s3 db1150" [12:20:19] and it should never touch existing dbs [12:22:01] started at https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Backups_quick_cheatsheet [12:22:01] it already says [12:23:06] To generate remote snapshots, the Transfer.py script is being used for the first part of the backup, installed on the cluster management server. Every day, a cron job runs daily_snapshot.py script, and send the snapshots to the provisioning hosts using transfer.py. Then, it runs locally on the provisioning host backup_mariadb.py in order to post-process the generated files and gather the metadata statistics. [12:23:22] http://en.wikipedia.org/wiki/Special:Search?go=Go&search=Transfer.py linked where documentation is intended [12:23:36] Transfer.py linked where documentation is intended [12:23:41] with [[ [12:23:43] and ]] [12:23:56] transfer.py is a generic utility, installed on the cluster management (orchestration) host (e.g. cumin1001) to transfer files over the network, but has a switch --type=xtrabackup, that allows also to transmit in a consistent way the mysql files of a live mysql server. [12:26:05] you should add the --prepare command [12:26:23] Yeah, will add it once I have built it/ran it [12:26:25] ( I forgot once and had to retrasmit the whole thing again) [12:26:29] :-P [12:26:55] and were to run it (cumin/mysql_root_host) [12:27:08] feel free to change things :) [12:27:14] I have pending to add more documentation [12:28:07] there are a few red links on that page I created on purpose [12:30:49] I explain here there is not yet a good recovery utility https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Recovering_a_Snapshot [12:31:01] because it is WIP [12:31:45] does it at least work for you? [12:34:36] it is working yeah :) [12:51:47] ANything else besides --prepare --target-dir and --innodb-buffer-pool-size that you use for the prepare? [12:55:43] check what I am using at the prepare function [12:55:48] that is mostly it [12:55:58] cannot remember all [12:56:24] yeah, I was doing it right now [12:56:29] looks like that is it indeed [13:00:26] at the moment the code is the documentation :-D [13:00:42] hahaha [13:01:05] but only because in the final state, one should never use transfer directly [13:40:20] i can change the bbu for db1093 whenever you are ready [13:40:31] let me depool it [13:46:39] cmjohnson1: I am stopping mysql, will ping you once I power the host off [13:48:09] cmjohnson1: db1093 is now off, you can proceed [13:52:18] db1093 is powering on [13:53:57] that was fast! [13:54:00] thanks! [13:54:21] cmjohnson1: did you see my message about db1133? that is the last one pending, not sure if you'll have time for it today? [13:58:23] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) Chris has changed the BBU and I can already see it: ` root@db1093:~# hpssacli controller all show detail | grep -i battery No-Battery Writ... [14:41:13] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [14:51:49] s1 compression on db1139 finished [14:52:01] nice! [14:52:14] how much is it? [14:52:31] 900 GB [14:52:47] cool [14:55:06] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1133.eqiad.wmnet'] ` The log can be found in `/v... [15:00:48] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [15:01:01] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @Cmjohnson re-created the RAID on site, but it is still showing up as degraded, so this host might need further troubleshooting. Not a big priority n... [15:26:19] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [15:35:59] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [15:42:08] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) @Marostegui @jcrespo please hold on to db2114. It looks like the system has some Hardware issues, I am investigating. [15:43:18] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) Thanks Papaul We won't do anything to any of the hosts until we've got the green light from you in this ticket Thanks! [15:48:38] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) db2114 Critical,Tue 07 May 2019 10:04:30,Fan redundancy is lost., Normal,Tue 07 May 2019 10:03:33,The fans are redundant., Critical,... [15:51:10] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) it looks like we are missng FAN 3B i am going to open the server and double check Status Name Type PWM (% of Max) RPM System Board... [16:00:58] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1133.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1133.eqiad.wmnet'] ` [17:25:21] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) I swapped FAN 3 with FAN 5 still have the same issue so the problem is not the FAN it has to be on the main board. I will contact DE... [17:29:16] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [18:59:32] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [19:04:13] 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) [19:05:16] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) @Marostegui @jcrespo please fell free to take this task. I open T222753 to track down the problem on db2114. [19:05:34] 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) p:05Triage→03Normal [21:39:03] 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Papaul) I spoke with Dell, they are going to replace the main board. Enterprise Service Request 990336510 [22:10:23] 10DBA, 10Jade, 10Operations, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) [22:37:17] 10DBA, 10Jade, 10Operations, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) @jcrespo Yes, it reflects our current understanding.