[05:49:40] 10DBA, 06Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293855 (10Marostegui) >>! In T166344#3293358, @Volans wrote: > I've ack'ed the Icinga alarm with this task. > > I've also forced a BBU learn cycle on db1016, it was looking good during the cy... [05:51:12] 10DBA, 06Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293868 (10Marostegui) It is now showing Optimal again: ``` BatteryType: BBU Voltage: 4074 mV Current: 0 mA Temperature: 32 C Battery State: Optimal BBU Firmware Status: Charging Status... [05:57:50] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3293870 (10Marostegui) [06:05:14] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3293872 (10Marostegui) [06:43:13] 10DBA, 06Labs: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3293932 (10jcrespo) @russblau No, new databases are in read only mode. We are unsure how we can support that and how (because of T156869), so for now all new servers are in full read only mode. You can create databases on to... [07:29:49] enwiki will take more than today to check, right? [07:31:25] oh yeah [07:31:34] a week more? [07:31:37] you want me to stop so you can check db2055 behaviour? [07:31:40] probably yes [07:31:44] no [08:21:09] so my working theory is that if the buffer pool is empty, like those servers just being restarted, and it has slower io, pt-table-checksum may populate the servers with suboptimal data [08:21:19] on peak times [08:21:26] so it may not be the hardware [08:21:51] could be [08:21:59] because it was running for all day yesterday [08:22:05] until around 10pm or so [08:22:11] think revision or tables [08:22:22] like pagelinks [08:22:29] much larger than the buffer pool [08:22:39] but rarely being used [08:22:43] while on active servers [08:22:58] it soon gets filled with the hotter rows [08:23:43] Yeah, could be [08:23:46] What we can do is [08:24:00] I will stop pt-table-checksum (don't want to leave it running during the weekend) [08:24:03] you can reboot the server [08:24:05] and we can leave it there [08:24:12] and on monday we can check how it behaved [08:25:54] no [08:25:58] leave it running [08:26:11] it is not a big problem [08:26:19] and we can try when it finishes [08:26:49] I am going to reimage other servers anyway [08:26:54] We'll see, I am not completely convinced about it (leaving it running) [08:29:44] well, if it was bad, other hosts would complain [08:29:48] specially on codfw [09:14:34] It actually may be the firmware [09:14:43] db2049 [09:16:24] it worked?? [09:16:58] it still complains about the power stuff, but that may not have been disabled by papaul [09:17:11] but it boots up, wow [09:18:53] something is wrong [09:18:58] installation failed [09:19:30] what does it say [09:19:43] nothing [09:19:53] lovely [09:20:00] install_console doesn't work [09:20:08] I think it is the same problem as before [09:20:18] just I wasn't fast enough to see it [09:21:43] nope [09:21:46] it looks good to me [09:22:06] how? [09:22:09] can you login? [09:22:13] https://phabricator.wikimedia.org/P5488 [09:22:15] Via ilo [09:22:27] let me see what puppet says [09:22:37] so it got the right passwords? [09:22:45] because that was that it was missing last time [09:22:52] I ran puppet and it worked fine [09:22:56] you should be able to login now [09:22:57] normally [09:23:06] The last Puppet run was at Fri May 26 09:14:45 UTC 2017 (6 minutes ago). [09:23:10] that is when I logged in [09:23:19] the installation says it failed [09:23:28] some something went wrong [09:23:48] ps aux [09:23:51] but if it is software this time [09:23:51] grr [09:24:02] what? [09:24:10] No, that I just typed ps aux here [09:24:13] and not in the terminal [09:24:33] I am going to restart it again [09:24:36] sure [09:25:03] I want to see the full install process [09:25:07] And I would even reimage it again to see if it fails again [09:25:08] yeah [09:25:15] this could be an installation issue rathen than hw [09:25:36] but db2055 worked fine yesterday no? [09:25:50] but I did not reimage it [09:26:01] just upgraded it then downgraded because of the issue [09:26:19] ah ok ok I thought you reimaged it [09:29:25] Hello ! Im looking for help with excel data on corporations and countries money ! [09:31:37] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3294123 (10Marostegui) a:05Marostegui>03None [09:31:39] marostegui: it booted ok, udev started, mount points worked ok [09:31:56] zewny: not sure this is the right channel for that :| [09:32:08] jynus: that is good then..let's see what else fails [09:32:17] at least we saw it booting up and /srv and / mounted finally [09:32:20] *finely [09:32:39] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3294125 (10Marostegui) a:05Marostegui>03None [09:32:56] marostegui: checking tables with primary keys is so much better [09:33:02] it can be now automatized [09:33:31] I think the reimports also fixed many of the watchlist issues [09:36:59] yeah, it was a good decision :) [09:50:34] hehe 19 hours to checksum revision [09:50:39] (just started) [09:52:36] marostegui: should we expect some lag like yesterday? :) [09:52:57] silence db1047 [09:53:20] volans: Hopefully not I will not leave it running anyways during the weekend [09:53:25] jynus: will silence it now [09:53:46] ok [10:00:28] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3294196 (10jcrespo) dbstore1002:watchlist is ok, the errors I found were false positives; now checking more reliably due to the primary keys. Checking now oldimage. [10:08:45] 10DBA, 06Analytics-Kanban, 06Operations: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294201 (10elukey) Hello @Marostegui, thanks a lot for the heads up! I checked `megacli -AdpBbuCmd -a0` again and this is the status: ``` BBU Capacity Info for Adapter: 0 Relative State of Charg... [10:09:02] 10DBA, 06Analytics-Kanban, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294202 (10elukey) [10:09:13] 10DBA, 06Analytics-Kanban, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10elukey) a:03elukey [10:10:50] 10DBA, 06Analytics-Kanban, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294205 (10Marostegui) Hello, If you are planning to keep that host for a long time (which I assume so) - I would definitely replace the BBU. I think @Cmjohnson might have spares fr... [10:11:18] marostegui: less than a minute response, woa :D [10:11:42] I was actually reviewing all the BBU issues we currently have [10:11:51] So you got me with the task handy! :) [10:12:45] 10DBA, 06Analytics-Kanban, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10jcrespo) Everything you say is correct. We are decommissioning many 10DBA, 06Analytics-Kanban, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294211 (10elukey) Yes let's replace the BBU, will wait for a confirmation from @Cmjohnson then! [10:13:38] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294212 (10Marostegui) [10:13:39] ouch I answered before Jaime's response [10:13:47] this is a good point [10:15:08] let's see if we have any spares then [10:15:19] We probably do [10:15:20] thanks a lot for the heads up! [10:15:28] should we audit all BBUs at this point? [10:15:47] volans: I think with Jaime's new alert we will start seeing them :( [10:15:56] we already do on databases and suggested such for other services [10:16:05] on my email [10:16:08] when they broke and already switch to the slow policy [10:16:22] I was thinking to check the remaining charge and be a bit more proactive [10:16:34] sure, but that is offtopic here [10:16:44] I am not saying it cannot be discussed [10:16:59] I am saying it has nothing to do with databases themselves [10:17:24] you should ask in operations- [10:17:55] basically, no decision related to general stuff should be taken here or people will complain [10:18:01] marostegui: one thing that I noticed only now on db1046 is that "Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU" [10:19:04] elukey:Probably the forced relearn worked, we have seen tha before, it works for a few days or so sometimes and hten goes back to fail [10:19:05] maybe manuel forced it, I think he comented the details on the ticket [10:19:22] yes, I forced the relearn [10:19:25] didn't volans did something similar on db1016 yesterday? [10:19:26] super [10:19:27] sometimes it takes quite sometime [10:19:34] I did force a learn cycle [10:19:38] force a relearn? [10:19:42] yes jynus and it is currently on WB again [10:19:44] and nothing else? [10:19:47] yeo [10:19:48] it was ok when charging [10:19:48] yep [10:19:57] maybe we have to also [10:19:58] then went bad when the learn completed [10:20:00] schedule [10:20:08] learn cycles [10:20:14] I disabled them fleet wide [10:20:27] because unscheduled ones were worse than nothing [10:20:32] yes, 100% agreed [10:20:40] I used to in the past, but I was in a situation where we accepted to have forcewriteback when running those [10:20:43] to avoid the performance issues [10:20:49] it's a trade off [10:20:57] but volans I do not thing there is a widespread issue [10:21:10] I think we just have lots of server we have to decomission [10:21:15] that are 5+ years old [10:21:17] I am worried that forcing a relearn might affect the BBU too and break it, don't know [10:21:37] it could detect it as broken, sure [10:21:41] this should be the focus: https://phabricator.wikimedia.org/T134476 [10:21:54] +1 [10:22:05] if it is not the BBU [10:22:08] it will be the raid [10:22:11] or the disks [10:22:19] or the cpu burning [10:22:25] yeah, or they'll die upon a restart [10:22:26] we just need to stop using them [10:22:52] cough T156844 cough [10:22:52] T156844: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844 [10:23:04] xddd [10:23:10] the other are totally our responsability [10:23:18] db1016 [10:23:21] db1048 [10:23:24] db1046 [10:23:32] it will only get more and more common [10:23:40] And db1031 [10:24:11] Yeah, this week we got 3 of them db1031, db1016 and db1046 [10:24:34] we can force a relearn cycle on reimage [10:24:40] at least [11:07:09] I will leave db2049 recovering its 1 week old backup [11:07:33] ok [12:36:11] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3294365 (10Marostegui) db2019 (codfw master) finished with all the alters and they are getting replicated downstream ``` root@neodymium:/home/mar... [12:56:23] Seconds_Behind_Master: 629831 [12:57:21] or 1.041 weeks [12:57:26] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294405 (10Ottomata) How soon is likely to happen? Early next FY or later? If within Q1, I'd say let's just wait and replace the box. Otherwise, let's fix the BBU. Eh? [12:58:21] Let's see how long it takes to catch up [12:59:06] lag is going down [12:59:13] relarively early and fast [12:59:24] so that would match the issues with s1 only [12:59:40] and not the kernel or the version [12:59:45] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294408 (10Marostegui) It is probably worth saying that the BBU might have been broken for a long time. We noticed because of the new check, but it would be too much of... [13:00:49] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294409 (10jcrespo) I agree with Manuel. while I would like to do the replacement ASAP, in reality it is not going to happen until Q2 or later. [13:00:52] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294410 (10Ottomata) > for a long time still. Agree but how long! It is slated for replacement next FY year sometime, right? Maybe we can just do it sooner rather th... [13:02:14] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294411 (10jcrespo) The reasoning is that labsdb has priority, and it is even on the best interest of analytics to to that first, if I understood correctly CC @Nuria [13:06:17] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294413 (10elukey) @Ottomata if Chris finds a BBU that among the spare parts that we have I'd say that we can do it asap, it should be a relatively painless downtime fo... [13:08:43] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294415 (10Marostegui) Another tip, once it is replaced (if it is) try to monitor its temperature once it boots up - in the last few weeks during some server moves we n... [13:09:05] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294419 (10Ottomata) +1 [13:33:40] 10DBA, 06Operations, 06Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3294468 (10jcrespo) While contention is bad in general- it is the opposite of lag- more contention would create less la... [14:02:36] I am changing db1068.eqiad.wmnet expire_logs_days to 30 [14:02:44] great [14:04:26] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3294497 (10Marostegui) [14:46:42] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3294553 (10jcrespo) [14:46:44] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3294551 (10jcrespo) 05Open>03Resolved oldimage checked. To the best of my ability, no more changes are left on core tables- noting that I have not checked every server, specially those that have filters, lag... [14:47:54] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3294555 (10Marostegui) \o/ [14:49:40] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3294557 (10jcrespo) I think this means we can decommission safely at least db1050 and db1023. [14:50:49] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3294560 (10Marostegui) I would leave db1050 for now (to at least have one old master just in case). db1023 can go [14:51:45] I was stating the can [14:51:50] not the will [14:52:10] :) [15:18:46] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3294650 (10jcrespo) a:03jcrespo [17:07:06] 10DBA, 06Operations, 06Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3295071 (10aaron) p:05Triage>03Normal