[04:09:28] 10DBA, 06Labs: page_lang column of the page table is not replicated to Labs - https://phabricator.wikimedia.org/T154355#3094592 (10TTO) a:03chasemp [07:35:38] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3094702 (10Marostegui) There are multiple errors on that host, related to memory and CPU (maybe it is the wrong DIMM bank affecting the CPU or the other way around as those can be related to each other)... [07:54:28] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3094724 (10Marostegui) In order to start getting ready to import s5 on sanitarium2 and labsdb1009,10,11 I am going to start: - Compressing InnoDB on db1070... [07:55:41] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3094726 (10Marostegui) db1030: ``` root@neodymium:/home/marostegui/databases_s2# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb1030 $i... [08:08:19] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3094745 (10Marostegui) s6 is all done now except the master, I am going to start altering db1050, the master. [08:37:37] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3094758 (10Marostegui) labsdb1009,10 and 11 - replication stopped db1095 replication stopped and mysql down data transfer between db1095 and dbstore1001 is... [08:54:35] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3094798 (10Marostegui) I think we are good to go now \o/: ``` Automatically selected FileSet: mysql-srv-backups +--------+-------+----------+-------------------+--------------------... [08:54:44] 10DBA, 06Operations, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3094799 (10Marostegui) 05Open>03Resolved [09:01:34] labsdb1007 import is done, it took 50 hours [09:01:44] wow, 50 hours [09:46:16] 10DBA, 13Patch-For-Review: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707#3094934 (10Marostegui) 05Open>03Resolved This looks good and have been working fine since Friday evening. [09:50:38] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3094946 (10Marostegui) @Cmjohnson once you are back in the DC can you check if you have any spare BBU? Thanks! [09:55:15] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3094959 (10Marostegui) I just realised that db1070 doesn't have .ibd files because of this: T137191 So I think I will reclone that host from a host that do... [09:56:05] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3094961 (10Marostegui) Probably also do a re-image won't hurt. [10:18:23] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3095032 (10Marostegui) Taking a mysqldump from db1070 and sending it to dbstore1001 now. [10:21:19] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3095040 (10Marostegui) p:05Low>03Normal [12:03:36] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3095262 (10jcrespo) From T147769: > Description: CPU 1 has an internal error (IERR). > Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:40:20Z] Shutting down es2015 for hardware ma... [12:04:15] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3095264 (10Marostegui) So looks like the CPU is broken then and needs replacement. @Papaul let's dismiss the DIMM change and proceed to change that CPU that has failed twice now? [13:07:00] jynus, marostegui: I've merged a mariadb change by mistake in https://gerrit.wikimedia.org/r/#/c/341576/ [13:22:49] luckily it doesn't seem to be the end of the world the mariadb change I've "undone" is https://github.com/wikimedia/operations-puppet-mariadb/commit/0cb1e6c4ed3d0efe3d7088dd20aea6dbc1710f56 if I'm not mistaken [13:23:32] I think it should be enough to revert my change: https://gerrit.wikimedia.org/r/#/c/342455/ [13:23:38] waiting for your input though [13:24:27] (and sorry for the mess!) [13:25:22] how that was possible? [13:26:36] I think because I forgot 'git submodule update' when rebasing my change [13:26:55] then puppet-merge said: [13:26:56] ------------------------------ [13:26:56] Emanuele Rocca: cache_misc: set timeout_idle to 120s (c8e9443) [13:26:56] Merge these changes? (yes/no)? yes [13:27:13] and I've missed the mariadb change in the diff output above [13:27:42] but the file did show on gerrit? [13:28:29] yes it did show up on https://gerrit.wikimedia.org/r/#/c/341576/ [13:29:27] I think it only kills mysqls on restart [13:29:48] jynus: ok for me to revert by merging https://gerrit.wikimedia.org/r/#/c/342455/ ? [13:30:27] yes, or just update to the latest version [13:31:08] so you did "commit all changed files command"? [13:31:26] becausr otherwise it has to be added manually for that to happen [13:32:37] yup, I did git rebase origin/production and then git commit -a --amend [13:32:41] sorry again, reverting [13:32:55] I do not mind mistakes, those happen [13:33:05] I want to understand why so it does not happen again [13:33:35] basically, I would recommend against commit -a [13:33:48] and self merges, one of the two [13:34:52] done, we should be fine now [13:35:30] I do not think so [13:35:51] I think it reverted to a previous state [13:36:01] mmh [13:37:36] isn't 0cb1e6c4ed3d0efe3d7088dd20aea6dbc1710f56 the last commit? [13:37:41] yes [13:37:53] no [13:38:00] a6c76942c83bff34364124805 is [13:38:18] maybe it has differed on my branch [13:38:45] on https://github.com/wikimedia/operations-puppet-mariadb/commits/master I see 0cb1e6c4ed3d0efe3d7088dd20aea6dbc1710f56 as the last one [13:39:30] ok, yes [13:39:51] I was confused with the id on OP/PU [13:41:44] ok [13:54:59] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3095626 (10akosiaris) We 've ended up promoting labsdb1007 to master, resyncing from planet.osm and pg_dump/pg_restore the various databases/tables.... [14:20:19] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3095661 (10akosiaris) And we are done. The rest of the databases/tables have been copied over, the DNS record has been updated and DNS caches cleare... [14:20:28] jynus: marostegui: done with labsdb1007. [14:20:31] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3095663 (10Marostegui) db1089 done ``` root@neodymium:/home/marostegui/databases_s2# mysql --skip-ssl -hdb1089 enwiki -e "show create table revision\G" ****************... [14:20:56] what would you recommend with 1006? [14:21:03] dump the user dbs? [14:21:11] copy the whole datadir just in case? [14:21:31] copying the datadir as a distaster backup wouldn't be a bad idea I would say [14:21:48] marostegui, but if it doesn't work on jessie it is wothless [14:21:58] we can do that. But to be honest, it's just 2-3 users, I doubt it's worth it [14:22:13] and most can just regenerate their data [14:22:20] at least from what I last heard [14:22:30] ah, ok ok :) [14:22:42] I would just reimage it as a slave [14:22:51] I can do that.. should be easy enough [14:23:16] but let's give it a day in case something weird shows up [14:23:21] +1 [14:23:22] which I very highly doubt [14:23:28] I sent a notice [14:23:33] ok [14:23:36] saying that maintenance will be extended [14:23:43] for the whole week [14:23:55] so definitely not in a hurry [14:24:12] I think faidon was only worried thinking we were blocked [14:25:44] to be sure, maybe we can ping users to see if they see any unexpected problem [14:25:58] that is a good idea [14:28:04] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3095669 (10jcrespo) @aude @MaxSem @Kolossos Can you verify your applications (e.g. restarting them) and see that they work as expected to be 100% th... [14:31:46] correct [14:32:36] not in a hurry, just trying to make sure that we're not blocked on anything and we can deliver this until the end of our deadline [14:32:53] paravoid, I sounded desperate because the first method we tried didn't work at all [14:33:03] then, it was just that the second method was slow [14:33:13] *very slow* [14:33:52] as technically labsdb1006 could die now and we didn't care, we want to take time to be sure nothing is broken [14:34:36] sure :) [15:20:59] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3095845 (10Marostegui) This master is now finishing the ALTER on ruwiki, the last database. I have started it on dbstore1001 which was pending (db1069 a... [15:41:05] 10DBA, 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3095875 (10Marostegui) [16:08:03] 10DBA, 10Analytics, 06Labs: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#3095980 (10Nuria) This seems like background work related to labs import rather than a task per se, moving to radar. [16:09:23] 10DBA, 10Analytics, 06Labs: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166#3095985 (10Nuria) Ping @JAllemandou Did you talk with labs team about this? [16:38:21] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3096053 (10Marostegui) db1050 (eqiad master) is now done: ``` root@neodymium:/home/marostegui/databases_s2# for i in frwiki jawiki ruwiki; do echo $i;my... [16:57:24] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3096110 (10Papaul) Since we swapped CPU's in T147769 and we still have the same error, I will contact Dell once on site tomorrow for CPU replacement. [17:42:01] 10DBA, 06Operations, 10ops-codfw: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3096318 (10Marostegui) Sounds good - thank you! if you need to "justify" it, the idrac logs are here: T160242#3094702 [17:45:46] 10DBA, 13Patch-For-Review: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3096331 (10Marostegui) ptwiki: differences on db1047 and dbstore1002. The checksum on svwiki stopped due to db1054 going down for maintenance. I will resume once it... [18:49:24] 10DBA, 13Patch-For-Review: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3096513 (10jcrespo) I am sorry :-(. [19:36:23] 10DBA, 13Patch-For-Review: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485#3096612 (10Marostegui) No worries whatsoever! This is now a background task that doesn't require much babysitting, so no problem at all!