[06:12:07] 10DBA: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773500 (10Marostegui) [06:14:26] 10DBA, 10MediaWiki-Change-tagging: db1100 replication broken - https://phabricator.wikimedia.org/T180917#3773512 (10Marostegui) 05Open>03Resolved a:03jcrespo Resolving for now...we'll see if it happens again on different servers too [06:25:51] 10DBA: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773523 (10Marostegui) From the syslog servers: ``` Nov 20 00:44:05 db2068 kernel: [10055954.989428] hpsa 0000:02:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap- En- Exp=1 Nov 2... [06:27:59] 10DBA: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773524 (10Marostegui) 05Open>03Resolved a:03Marostegui A reboot fixed it, MySQL started fine and it is now catching up. RAID looks fine also. Closing this for now, if this happens again we'll probably need a RAID controller replacement. [06:29:13] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1063.eqiad.wmnet ``` The log can be found i... [06:40:35] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3773534 (10Marostegui) [06:42:19] 10DBA, 10Performance-Team, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3773541 (10Gilles) **prepared-edit** indicates that this is part of edit stashing. Which is preemptive proces... [06:58:05] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3773578 (10Marostegui) I don't want to hijack this thread, but I do believe the #operations tag should still be present. As Dereckon said, the process rarely... [06:58:07] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3773579 (10Marostegui) I don't want to hijack this thread, but I do believe the #operations tag should still be present. As Dereckon said, the process rarely... [07:00:12] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3773595 (10Marostegui) When would you like to do this? [07:27:37] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773605 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1063.eqiad.wmnet'] ``` [07:27:40] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1063.eqiad.wmnet ``` The log can be found i... [07:27:57] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773607 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1063.eqiad.wmnet'] ``` [07:28:13] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773608 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1063.eqiad.wmnet ``` The log can be found i... [07:52:37] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3773617 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1063.eqiad.wmnet'] ``` and were **ALL** successful. [08:01:57] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Drop the "wb_terms.wb_terms_language" index - https://phabricator.wikimedia.org/T179106#3773619 (10Marostegui) [08:28:22] 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3773638 (10Marostegui) I don't think migrating to ROW is something we can actually do now after seeing the breakage caused when s5 master died (T180714) and there was a schema change on-going.... [09:42:19] 10DBA, 10Cloud-Services, 10Cloud-VPS, 10Operations, 10Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3773751 (10jcrespo) This is still valid, labsdb1006 is still not setup and labsdb1007 is a single point of failure. [09:42:28] 10DBA, 10Cloud-Services, 10Cloud-VPS, 10Operations, 10Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3773752 (10jcrespo) a:05jcrespo>03None [09:44:53] 10DBA, 10Performance-Team, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3773755 (10jcrespo) Ok, you can triage as low/close if it should happen, but maybe some wait timeout can be t... [09:47:17] 10DBA, 10Performance-Team, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3773775 (10jcrespo) Note CategoryMembershipUpdates are around half of the errors right now. [09:50:01] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3773776 (10jcrespo) Contradicting Marostegui, I do not think either #Operations or #DBA should be tags here. The fact that @Marostegui offered to help in past... [10:59:18] 10DBA, 10Operations, 10ops-codfw: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773892 (10jcrespo) 05Resolved>03Open Maybe related: T102236 We need a BIOS upgrade and the HW logs. [11:01:31] 10DBA, 10Operations, 10ops-codfw: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773901 (10Marostegui) a:05Marostegui>03Papaul @Papaul can you help us with the BIOS upgrade? @jcrespo there were no HW logs from the crash, there are only the typical ones AFTER the crash that doesn't say... [11:10:01] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3766852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1110.eqiad.wmnet ``` The log can be found in `/var/l... [11:10:09] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3773915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1109.eqiad.wmnet ``` The log can be found in `/var/l... [11:16:44] 10DBA, 10Operations, 10ops-codfw: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3773925 (10jcrespo) From the "health log": ``` 4 Critical Drive Array 11/20/2017 00:33 06/10/2015 16:05 2 Drive Array Controller Failure (Slot 0) ``` [11:32:10] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3773962 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1110.eqiad.wmnet'] ``` and were **ALL** successful. [11:34:02] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3773973 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1109.eqiad.wmnet'] ``` and were **ALL** successful. [12:05:12] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774043 (10alanajjar) @Marostegui Are you here now? [12:22:40] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774102 (10Marostegui) yes, go ahead if you like Please paste the progress URL so we can check it too [12:24:31] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774106 (10alanajjar) Thanks @Marostegui [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/KnudW|The progress]] [12:34:31] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3774129 (10Marostegui) db1063 has been rebuilt and it is now catching up. Before putting it back as vslow, I am going to optimize wb_terms table as we have been doing... [12:35:13] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3774130 (10Marostegui) [12:53:15] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3774176 (10Marostegui) The scope of this ticket is actually all done. So I would suggest we close it and do any amends or follow ups taken from the IR: https://wikitec... [12:55:06] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3774178 (10alanajjar) @Marostegui can we preform this one also? [12:55:32] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3774179 (10Marostegui) Let's wait for the other one to finish first. [12:56:02] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3774181 (10alanajjar) @Marostegui Yes, of course (Y) [12:59:44] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3774192 (10Steinsplitter) [13:02:03] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3774215 (10Steinsplitter) a:05Marostegui>03None [13:34:29] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774277 (10Marostegui) This looks done, feel free to resolve and start the other one if you like. [13:45:40] hello people, I prepared a patch to run a daily cron job to sanitize the log database (with eventlogging_cleaner.py) https://gerrit.wikimedia.org/r/#/c/391828/ [13:46:10] just wanted to let you know and have your opinion before pulling the trigger [13:46:21] this should finally complete the job of EL sanitization [13:46:31] I am fine with it, you have more idea about the data itself than me :) [13:46:49] Does it have a dry-run? [13:47:21] I have already ran it in batches up to 90 days ago, all good on db1108 :) [13:47:28] then… [13:47:32] * marostegui going to +1 it [13:47:57] only for db1108 as you said in the commit then? [13:48:27] yes I put the hack of installing it for debian only, so the trusty hosts will not get anything.. I hope to kill the log database on them soon [13:48:37] :) [13:49:49] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774331 (10alanajjar) @Marostegui Thanks for help. I prefer to wait @Linedwell to preform T180903 because it's a request on GlobalRenameQueue and I don't n... [13:50:02] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774335 (10alanajjar) 05Open>03Resolved [13:50:29] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of Knud Winckelmann → KnudW: supervision needed - https://phabricator.wikimedia.org/T180703#3774340 (10Marostegui) cool thanks [13:51:52] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Angr → Mahagaja - https://phabricator.wikimedia.org/T180946#3774344 (10Steinsplitter) [13:58:24] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 3 others: Deploy dropping wb_entity_per_page table - https://phabricator.wikimedia.org/T177601#3774360 (10chasemp) [13:59:01] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 3 others: Deploy dropping wb_entity_per_page table - https://phabricator.wikimedia.org/T177601#3664286 (10chasemp) [14:16:12] BTW, if you saw s5 replication lag graph, it seems much better after ROW migration [14:16:32] maybe it is semi-sync [14:17:39] That is one of the things we always discussed, whether it would be indeed better with ROW or not [14:19:31] should I enable semisync on db1070 and see how it goes? [14:19:40] sure, let's try [14:19:55] or in db1063 if you like ;) [14:20:06] ah sorry, i thought you said db1071 [14:20:07] :) [14:20:25] * marostegui still needs to adapt to the new master, always thinks it is db1071 [14:20:27] well, it has to be on the master first [14:20:34] yeah i know ;) [14:25:04] it says 7 clients [14:25:20] but I am not sure how accurate is that, as I have not restarted replication on any of them [14:25:31] maybe it works if it was already enabled? [14:25:38] I will restart it on 63 and 71 [14:25:49] cool [14:26:02] fyi: db1063 is depooled and will remain depooled till tomorrow (optimizing wb_terms) [14:35:21] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3774462 (10Linedwell) Rename in progress: [[ https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Sigma%27am | link to the progress ]] [15:15:59] 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3774602 (10Linedwell) 05Open>03Resolved a:03Linedwell [16:56:19] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3774972 (10Papaul) The ILO is up to date. I need to update the Storage and BIOS on the system but the Service pack disk that i have is old, there is a new Service pack ISO on the HP site t... [21:11:12] 10DBA, 10Performance-Team, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3775891 (10Gilles) a:03aaron [21:19:21] 10DBA, 10Performance-Team, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3775910 (10Krinkle) p:05Triage>03High