[00:07:36] 10DBA, 10Operations, 10Release-Engineering-Team, 10cloud-services-team, 10wikitech.wikimedia.org: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3911639 (10bd808) [00:10:03] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s3 - https://phabricator.wikimedia.org/T167973#3911644 (10bd808) Should we merge this into {T184805}? From the #cloud-services-team side we don't have a concern about which particular prod slice the t... [06:29:58] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s3 - https://phabricator.wikimedia.org/T167973#3911949 (10Marostegui) We still don't have a clear idea what will be moved to where, but it is good to know you guys don't really mind. Thanks! At the m... [06:42:08] 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3899829 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2034.codfw.wmnet'] ``` The log can be fou... [07:09:40] 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888#3911977 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2034.codfw.wmnet'] ``` and were **ALL** successful. [08:25:08] 10DBA: Check data consistency across production shards - https://phabricator.wikimedia.org/T183735#3912050 (10jcrespo) [08:25:12] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3912051 (10jcrespo) [08:25:14] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3912049 (10jcrespo) 05Open>03Resolved [08:56:53] so these compare.py could be also nice to be run (not very often) to force innodb checksums and reveal existing corruption [08:59:03] yeah [08:59:30] however it boots. Maybe an installer only issue? [08:59:43] I just installed db2034 [08:59:47] early in the morning with no issues [08:59:58] no, I mean an installer issue on db2036 [09:01:02] Could be if there is an underlying HW when formating partitions or something [09:03:57] no, this is pre-formating, it is detecting disks [09:08:10] now it worked, very strange [09:08:27] I would re-install it again, just to see if it works again [09:08:37] I am going to mark it on mediawiki-config with "unstable" [09:08:41] and create a ticket [09:08:51] I want to forget about it [09:09:13] ok [09:09:16] no ticket was created yesterday, right? [09:09:21] no [09:10:41] do you have the paste where you copied the error handy? [09:10:49] yeah [09:10:51] give me a sec [09:11:00] thank you, and sorry to bother you [09:11:11] not bothering! [09:11:25] https://phabricator.wikimedia.org/P6611 [09:14:29] 10DBA, 10Operations: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912108 (10jcrespo) [09:14:59] 10DBA, 10Operations: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912119 (10jcrespo) 05Open>03stalled [09:15:08] [ 0.151226] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) [09:15:19] I didn't check the hw log [09:15:22] time to upgrade its firmeware? [09:15:24] firmware? [09:15:25] maybe there is controller failures [09:16:15] Maybe we should check with papaul and upgrade all the firmwares possible [09:18:10] ah [09:21:13] 10DBA, 10Operations: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912120 (10jcrespo) These are the latest logs from the hw: ``` 12 Repaired Drive Array 01/16/2018 15:31 01/16/2018 15:31 1 Internal Storage Enclosure Device Failure (Bay 6, Box 1, Port 1I... [09:21:34] a disk was recently replaced there [09:22:03] yeah [09:22:06] this [09:22:17] firmware was apparently flashed on 2015 [09:22:31] https://phabricator.wikimedia.org/T184836 [09:23:43] nothing conclusive, though [09:23:51] copying data should be a good test [09:24:25] indeed [09:24:31] we should upgrade all the firmwares too [09:25:02] 10DBA, 10Operations: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912124 (10Marostegui) This server got a disk replaced a few days ago: T184836 [10:10:59] 10DBA, 10Operations, 10ops-codfw: db2036 storage issues? (mysql crashed, installer issues) - https://phabricator.wikimedia.org/T185294#3912217 (10Marostegui) Probably we should try to upgrade BIOS, raid controller etc... [10:16:33] Hey, I want to do this in Monday: https://phabricator.wikimedia.org/T185032 [10:17:02] It will increase number of rows in wbc_entity_usage table to some degrees (depends on the wiki) [10:17:34] but it doesn't make the number of rows e.g. twice [10:18:25] it's needed to ease the RC injection problem and jobqueue issue [10:18:29] Amir1: I would suggest you enable it on, let's say, s7 and wait a few days. As next week there will be lots of people traveling and coverage will be somewhat reduced [10:19:21] How does that sound? [10:19:23] marostegui: that makes sense [10:19:52] s7 for Monday, and others around Wednesday or Thursday? [10:22:32] Those days most of us are traveling [10:22:38] and thursday is the first all-hands day [10:22:58] s7 on Monday that's fine I would say, the others, I am not sure about them [10:36:36] marostegui: okay, it seems All hands is until the weekend, let's do it on the next Monday [10:38:45] you mean after the all hands, right? [11:28:56] marostegui: yup [11:29:09] that sounds good yeah, after all hands :) [11:29:34] note it is not only the wmf meeting, there is a developers meeting on those dates, too [11:29:52] yeah, the dev summit [11:31:43] 10DBA, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3912375 (10jcrespo) [11:35:26] 10DBA, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3912396 (10jcrespo) [11:53:28] 10DBA, 10Patch-For-Review: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3912443 (10jcrespo) [12:26:26] 10DBA, 10Operations, 10ops-eqiad: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3912507 (10jcrespo) [12:30:21] 10DBA, 10Operations, 10ops-eqiad: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3912510 (10jcrespo) a:05jcrespo>03Cmjohnson Chis, these 2 are ready to be unracked or whatever it is its end of life (after being wiped). [13:13:45] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3912576 (10Marostegui) tag_summary is now fixed. Next: user_newtalk [16:01:48] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3912890 (10Marostegui) s8 is only pending the master. And once it gets done, we will need to sanitize the tables on sanitarium+labs... [16:14:02] I think there is something wrong with db2036- db2018 is less powerful, but it is taking much less time to catch up from replication [16:14:43] yeah, could be storage related again :( [16:14:58] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=3&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops [16:15:17] (the difference is double, the graph is log-y) [16:17:39] maybe the disk that we replaced is somewhat damaged or something? [16:17:45] (it was a reused disk as far as I know) [17:36:39] 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913251 (10jcrespo) [17:40:32] 10DBA: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913256 (10jcrespo) [17:42:34] 10DBA, 10Operations, 10ops-codfw: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913257 (10jcrespo) a:05jcrespo>03Papaul Papaul, these 7 old hosts are ready to go, and we should make room for others. [17:51:52] 10DBA, 10Operations, 10ops-eqiad: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3913281 (10jcrespo) [18:07:06] 10DBA, 10Operations, 10ops-codfw: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913305 (10Papaul) a:05Papaul>03jcrespo @jcrespo thanks. Can you please do the steps below and assign the task back to me. Thanks Disable puppet on host Rem... [18:10:15] 10DBA, 10Operations, 10ops-codfw: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913309 (10RobH) a:05jcrespo>03RobH Please note that we shoudl do those steps, not Jaime, since he cannot disable the switch port (which has to be done at the... [18:13:26] 10DBA, 10Operations, 10ops-codfw: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913318 (10jcrespo) Note I have not a problem to do those if told, but specially disabling puppet should be done just before literally shutting down the servers an... [18:19:54] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913329 (10RobH) [18:20:21] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3913331 (10jcrespo) [18:37:26] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913366 (10RobH) Switch ports for later removal (once they are unracked): ge-6/0/0 - db2016 ge-6/0/1 - db2017 ge-6/0/... [18:39:09] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913368 (10RobH) [18:44:42] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913381 (10RobH) [18:45:14] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3872109 (10RobH) a:05RobH>03Papaul Ok, these are all ready to have disks wiped, unracked, and racktables updated.... [18:45:26] 10DBA, 10Operations, 10hardware-requests, 10ops-codfw: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3913384 (10RobH)