[07:12:15] 10DBA, 10Operations, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc1007 is now serving. I have also updated tendril and zarcillo to reflect that it is the master for pc1. pc1010, pc2007 a... [07:12:29] 10DBA, 10Operations, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [07:12:36] 10DBA, 10Operations, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) 05Open→03Resolved [07:27:00] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [07:28:46] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [07:51:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [08:20:28] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [08:51:56] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (101997kB) [08:53:06] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10Marostegui) start now if you like [08:54:02] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (101997kB) ok starting [08:55:03] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10Marostegui) whenever you can, paste the progress URL here so we can track the progress Thanks! [08:55:31] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (101997kB) [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Dennis_Radaelli|the progress]] [08:56:02] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10Marostegui) Thank you! [08:59:08] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10jcrespo) Let's add @Anomie here once so he can verify this didn't affect ongoing actor migration as per https://wikitech.wikimedia.org/wiki/Deployments#Week_of_Ja... [09:02:35] marostegui: I think there are replication issues on s2 [09:03:15] yeah [09:03:17] I am seeing those [09:03:19] itwiki is there [09:03:29] so it is probably the rename [09:03:38] as itwiki was the one with more edits there [09:03:44] I "only" see the rc slaves delayed [09:04:10] it is recovering now [09:04:34] it takes a lot to edit [09:04:51] now only db1105 is delayed [09:04:53] still doing the rename [09:05:11] it is recovering now [09:05:18] recovered [09:07:16] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (101997kB) Rename successfully completed. Thanks. [09:08:12] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10Marostegui) For what is worth: this caused lag on the two rc slaves: db1103:3312 and db1105:3312 [09:13:27] there is lag on s1 labsdb, is that expected? [09:13:45] yep [09:13:47] that is me [09:13:51] ok, np then [09:13:56] I am doing an alter table which involves changing triggers, so better be safe :) [09:14:05] sure [09:15:37] I am going to upgrade db2078 while wikidata config deployments are ongoing [09:15:40] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10Marostegui) 05Open→03Resolved a:031997kB [09:15:42] ok [09:15:44] (multi-misc) [09:21:55] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10fgiunchedi) [09:28:04] this is the second time in a short time that a host gets bricked after a restart [09:28:19] powercycling usually unbricks them [09:28:54] what do you mean? [09:29:16] host shuts down but it does not come back [09:29:32] ?????????? (invalid characters) on screen [09:29:33] oh really? [09:29:36] :| [09:29:50] is it always the same type of host? [09:29:52] I mean, vendor [09:30:21] yes, but I think different batches -eqiad vs codfw [09:30:57] how long you waited for it to manually power cycle it? [09:30:58] I notice because usually the new ones boot up very quickly [09:31:07] enough [09:31:23] in fact, the ????? is printed on a normal boot [09:31:37] but it seems to get stuck mid it [09:31:45] is it a hp or a dell? [09:31:51] dell [09:32:10] I restarted a bunch for upgrades before christmas and didn't find any issues, I guess I didn't hit that batch :( [10:08:04] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [10:27:50] "[*** ] A stop job is runniung for the System Logging Service (1min 30s / 1min 30s)" [10:28:14] what's that? [10:28:30] where db1091 spent most of the time shutting down [10:28:57] oh [10:31:29] this is unrelated to the previous issue [10:31:33] this is an hp [10:35:34] I wonder if we could/should make a copy of db1115 before shutting it down, at least the metadata tables [10:35:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [10:36:13] jynus: yeah, that is a good idea [10:36:31] maybe a binary copy [10:36:50] Oh, it is 1.7TB [10:37:00] how is that possible [10:37:23] we'd still have time to copy that somewhere else [10:37:25] yeah, that is why I was suggesting a logical copy of just zarcillo and the tendril metadata (not the stats) [10:37:37] yeah, let's do that I would say [11:01:15] no plans to add or remove any hosts in the following hours, right? [11:01:33] nope [11:08:06] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [11:16:57] I am going to start purging cruft on tendril [11:17:08] old binary logs and logs that are taking a lot of space [11:17:26] great [11:17:33] I am not touching tendril [11:17:42] ANd I will head out for lunch in like 30mins too [11:17:44] cannot guarantee that will not create some slowdowns due to io [11:17:57] yes, I was going to do the same due to meetings maintenance [11:18:02] yeah [11:18:02] just wanted to give you a heads up [11:18:04] sure [11:18:15] I am not going to run anything that might require tendril to help us monitoring [11:19:56] down to 1.3T on the first rm [11:20:17] (old binary logs while those are disabled now) [11:20:47] true! [13:10:47] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) Let's wait for {T213706} to be done before we migrate the existing copy of `stagingdb` to any of the hosts. [13:11:02] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [13:24:09] 10DBA, 10Operations, 10Performance-Team: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) a:05aaron→03Marostegui I have been talking to @Joe and we have decided, just be on the safe side. We will increase the TTL just by 2 days, and ma... [13:31:46] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [14:10:55] Marostegui schools were cancelled here again today. I will be getting to data center but not for another 2.5 hours. Have to drop my kid at a friends house. Ref: es1019 and db1119 [14:23:08] cmjohnson1: will get back to you in a few, im in a meeting with jaime and m.ark [14:31:12] please be safe, cmjohnson1 [14:31:15] 10DBA, 10Wikimedia-Site-requests: Global rename of SuperVirtual → Dennis Radaelli: supervision needed - https://phabricator.wikimedia.org/T213796 (10Anomie) >>! In T213796#4880227, @jcrespo wrote: > Let's add @Anomie here once so he can verify this didn't affect ongoing actor migration as per https://wikitech.... [14:32:14] jynus: you think you will be here till late? I think I am going to logoff in around 1h, so we either re-schedule with cmjohnson1 for some other day, or if you had plans to stay a bit longer, you can do it with him. Up to you! I am fine re-scheduling! [14:33:09] yes, I can attend chris, I got in late today [14:33:19] ok! :) [14:33:31] cmjohnson1: jynus will be here for you then :) [14:33:43] leave comments for m*rk on the doc however if you have some [14:33:48] yep [15:38:49] jynus: I am going to logoff, please text me if something goes bad with the maintenance and you need help! [15:43:30] np [16:43:55] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) @Cmjohnson The most likely scenario is that we move the dimm and we keep detecting 96GB of ram, and then we will ask you to ask for a replacemen... [16:48:04] jynus...i am on-site and can do es1019 now...just need to unplug and wait 10 secs [16:49:32] ok, shutting it down [16:49:44] I prepared everything except the actual shutdown [16:49:54] as I wouldn't be able to power it on [16:50:28] it is going down now [16:50:30] great [16:50:53] did you see my comment about searching for an updated mgmt firmware? [16:50:55] if you want can you take db1119 down as well so I can move DIMM around [16:51:01] no [16:51:07] the same thing happened 5 times already [16:51:17] so maybe worth it ? [16:51:17] for es1019? I can update f/w [16:51:22] it will be down a little longer [16:51:25] that is ok [16:51:31] note the other is db1115 [16:51:37] NOT db1119 [16:51:53] give me another second for db1115 [16:52:15] es1119 should be off by now [16:53:42] oh..sorry [16:58:33] db1115 is going down now too [16:58:53] great! thx [16:59:22] free reign to handle T196726#4873183 [16:59:22] T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 [16:59:50] and T213422#4881615 [16:59:51] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [17:01:22] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Cmjohnson) racadm SEL root@db1115.mgmt.eqiad.wmnet's password: /admin1-> racadm getsel Record: 1 Date/Time: 01/18/2018 11:23:25 Source: sys... [17:06:06] 10DBA, 10Operations, 10ops-eqiad: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Cmjohnson) i swapped the dimm from a2 to b2 and cleared the log. Please put back in the rotation and let's see if and where the error occurs. [17:06:22] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) @fgiunchedi: Thanks for updating about the ms-be systems! I see you added they can be gracefully powered down, can we just power them back up and ensure puppet runs post... [17:07:26] jynus db1115 is powering up [17:07:31] 10DBA, 10Analytics, 10Operations, 10ops-eqiad: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) >>! In T213748#4881709, @RobH wrote: > @fgiunchedi: Thanks for updating about the ms-be systems! I see you added they can be gracefully powered down, can we just power t... [17:08:06] cool, I will check if it detects the stick at all [17:08:26] because if not, it clearly is faulty :-) [17:08:58] yeah, the 32 GB came back [17:09:10] we will see soon if there are errors [17:09:19] I will put the server back into service for now [17:11:08] yeah...unfortunately I have to do this test to see where the error goes...i may have to swap CPU's but we will wait and see [17:11:41] es1019 is in the middle of updates [17:12:06] should I take a half an hour break? [17:12:31] I am not in a hurry, but I don't want to keep you waiting either [17:12:57] I'm not waiting....I just have to wait for the install to complete [17:13:14] yeah, I mean after it completes [17:13:19] While it's down...i am going to update the raid controller f/w as well (if that's okay) [17:13:26] thanks [17:13:36] just power it up when/ if it completes [17:13:47] and I will be back in case you need me [17:13:48] okaky [17:13:55] okay [17:14:03] I will ping you here [17:14:04] thanks for that, cmjohnson1 [19:58:54] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) p:05Triage→03High [20:01:39] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) [20:10:22] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) I would like to aim for Thursday at 7AM UTC [20:21:58] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @jcrespo I have created the usual failover checklist [20:33:17] 10DBA, 10Operations: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) p:05Triage→03High [20:41:21] 10DBA, 10Operations: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) This needs to happen before Thursday 17th [21:27:34] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @Anomie I will let you know if we need to pause your migration script - as the lag in codfw would make the failover harder. Once we have agreed on a date/time we will talk to you! [21:40:14] 10DBA, 10Operations: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Anomie) Thanks for letting me know about the failover. It will probably kill the script anyway when the old master goes away, or at least whichever s3 wiki it happens to be processing at the time. I won't...