[05:55:09] 10DBA, 06Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3299396 (10Marostegui) [05:55:44] 10DBA, 06Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3298788 (10Marostegui) @Cmjohnson once you get the replacement BBU from HP, let us know as we need to depool this host before shutting it down. Thanks! [06:21:14] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3299407 (10Marostegui) [06:23:21] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3299408 (10Marostegui) (As I was expecting) I have seen some issues when importing compressed tablespaces. I am troubleshooting it. [06:28:02] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3299412 (10Marostegui) db1084 is now done: ``` root@neodymium:/home/marostegui# for i in `cat s4_tables`; do echo $i; mysql --skip-ssl -hdb1084 c... [06:28:07] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206#3299413 (10Marostegui) [06:43:27] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3299426 (10Marostegui) >>! In T166278#3299409, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AVxYBnvKQMK9DA-... [06:43:43] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278#3299427 (10Marostegui) >>! In T166278#3299425, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AVxYFnADQMK9DA-... [07:58:21] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3299482 (10Marostegui) While troubleshooting db1095 gave a nasty error and I do not trust it much now I have read a few posts that compression + transporta... [08:13:43] jynus,marostegui - the two analytics hosts with WT seems to have a faulty BBU, thanks for the alarms! [08:14:15] we are close to the end of warraty so it is really nice to review these settings now :) [08:14:26] you should enable the alarms [08:14:38] we only enabled them for dbs [08:14:54] jynus: will do right after replacing the faulty bbus, I am monitoring all the analytics hosts via cumin [08:53:38] another question for you guys - I am wondering what would be the best way to deploy the analytics eventlogging maintenance scripts (https://gerrit.wikimedia.org/r/#/c/355604/) [08:54:13] my team would prefer to create/maintain a debian package [08:54:22] sorry, *not create [08:54:29] when I asked [08:54:42] my answer was debian package or scap [08:55:11] I went for debian package because it seemed wrong to deploy to a database server [08:55:28] yeah I got a similar feeling, this is why I asked [08:55:38] +1 to a .deb [08:56:22] the alternative, since it is not a big codebase, is to squeeze it in a single python .py file (test classes and code) and deploy it via puppet [08:57:27] we do something similar in https://github.com/wikimedia/puppet/blob/production/modules/webperf/files/navtiming.py#L419 [08:57:28] but where? I'd be against to deploy it on a mysql host, it's a job for a maintenance host IMHO [08:58:26] well the idea was to run locally on the slaves first, to interleave it with eventlogging_sync.sh [08:58:47] atm the auth scheme in the script assumes the localhost /tmp unix socket [08:59:37] "atm the auth scheme in the script assumes the localhost /tmp unix socket" [08:59:41] that is really wrong [08:59:46] and false already on some hosts [09:00:23] that is why I created WMFMariaDB, to isolate that decision [09:00:36] well it should be enabled on the analytics slaves and IIRC you said that it would have been the preferred way to connect, this is why I did it.. but I can add ssl support too, no problem [09:00:45] no [09:00:49] the problem is the /tmp [09:01:04] sure, and the eventlogging code is meant to support a migration to WMFMariaDB when it will be ready [09:01:07] that is not the default [09:01:22] and it is only legacy, so it will change at any time [09:02:07] in fact there is a planned rolling restart to change that and the mysql group GID [09:03:05] what is the preferred way to connect in your opinion? I always ask since I'd like to make sure to follow your best practices (your == whoever maintains the DBs) [09:03:25] check the my.cnf and use that, assuming t there is only 1 instance [09:03:54] all right I'll review that part of the code [09:03:57] that is what wmf does or it will do when it detects connections to localhost [09:05:51] alternatively, read the puppet config for socket and create a config file based on it [09:07:55] volans: "I'd be against to deploy it on a mysql host" - let's chat about this please :D [09:09:22] sure [09:11:16] we usually run maintenance from a maintenance host for various reasons (that the DBAs can explain much better than me), my simple point being that we have a dynamic environment where hosts change place, role, etc... also why adding the load of the script itself to the mysql host while it could be elsewhere. But I'm open for discussion if there are advantages to have it local ;) [09:13:10] the only one was to interleave it with eventlogging_sync.sh (that runs locally on each analytics slave), but then I discovered that it basically runs constantly so not really a good motivation [09:15:47] (need to go to the dentist, will restart the conversation after lunch sorry, thanks for the inputs!) [09:25:15] 10DBA: Point labsdb1001 and labsdb1003 to db1095 - https://phabricator.wikimedia.org/T166546#3299593 (10Marostegui) [09:25:26] 10DBA: Point labsdb1001 and labsdb1003 to db1095 - https://phabricator.wikimedia.org/T166546#3299605 (10Marostegui) p:05Triage>03Normal [09:34:43] should I ack dbstore1002 s5 replication? [09:34:46] s4 [09:35:03] oh, it's now gone [09:35:31] jynus: marostegui is gone (netsplit) [09:35:54] no, I mean the alarm is gone [09:38:43] manuel took it with him :-P [10:08:39] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3299698 (10Marostegui) I have been able to reproduce the issue on my local environment with compressed tables. It always fail. Tried different things with... [10:35:22] 10DBA: db2055 mariadb weirdness -lagging after reboot and upgrade - https://phabricator.wikimedia.org/T166326#3299786 (10jcrespo) 05Open>03Resolved I am now convinced this was a combination of pt-table-checksum + cold buffer + 2nd tier slave making it lag. Could not reproduce on kernel and mariadb upgrade.... [11:47:45] 10DBA: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342#3299876 (10Marostegui) I have renamed updates table on `enwiki` ``` db1089: rename table updates to T139342_updates; ``` Will leave it like that for a week before starting to drop it (after taking a backup). [12:39:02] 10DBA, 07Epic: Meta ticket: The future of multi source replication slaves vs multi instance ones. - https://phabricator.wikimedia.org/T159423#3299951 (10Marostegui) Another issue faced when trying to import compressed tablespaces into a host (which is only needed for multi source hosts really, as the other one... [12:46:14] (resuming prev conversation) - a solution for volans comment about running eventlogging scripts from a maintenance host (avoiding also to deploy on any db) is to use neodymium [12:49:19] or terbium? [12:49:59] whatever you guys prefer, terbium looks good [12:50:09] ack on rather using terbium [12:50:33] moritzm: would it be acceptable to scap deploy in there the repo and run a cron? [12:53:54] it runs plenty of cron jobs for maint jobs already [12:54:06] I'd prefer it run on the local host [12:54:12] not sure about how scap is managed, best to ask Filippo or Antoine I'd say [12:54:22] the maintenance he is roposing is per host, not per service [12:54:44] if the host is down, the maintenance doesn't have to be done [12:54:58] and each host has its own state [12:56:44] that is a very good point too [12:57:04] if it was maintenance for the application, it should run from a single host [12:58:21] it will also create a huge overhead for a host that traditionally was reserver for mediawiki maintenance [13:16:38] in the localhost case, single puppet file then? To avoid a scap deploy to db1047 and dbstore1002 ? [13:24:30] "Installation step failed" [13:24:50] An installation step failed. You can try to run the failing item [14:12:59] 10DBA, 06Labs, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3300140 (10Marostegui) db1095 has been restored from backups and I have started replication there and on labsdb hosts and db1070 let them catchup. Probably... [14:14:51] 10DBA, 06Community-Tech, 10MediaWiki-User-management: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3300167 (10MusikAnimal) >>! In T156318#3297550, @jcrespo wrote: > How do you plan to limit the ranges in both cases to avoid... [14:26:23] I was checking out mydumper and I was wondering how did you overcome the SSL issue with it? [14:44:45] 10DBA, 06Community-Tech, 10MediaWiki-User-management: Do test queries for range contributions to gauge performance of using different tables - https://phabricator.wikimedia.org/T156318#3300291 (10jcrespo) > Right now, as far as I can tell, the code does not generate a query that forces an index No, of cours... [14:44:47] 10DBA, 06Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3300292 (10Cmjohnson) a support case has been opened with HPE Your case was successfully submitted. Please note your Case ID: 5320105305 for future reference. [14:45:09] 10DBA, 06Operations, 10ops-eqiad: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3300294 (10Marostegui) Thanks! [14:46:18] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300302 (10Cmjohnson) @ottomata: is there a better time this week or do you push it out to next week? Also, whatever we change this out with will probably not last l... [14:47:53] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3300306 (10Cmjohnson) @volans and @Marostegui I can do this as soon as you give me the word go but keep in mind this is only going to be temporary. the bbu's... [14:53:39] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3300333 (10Marostegui) Thanks Chris, I will have this ready for tomorrow so we can do it tomorrow if that works for you? We are aware that this will happen ag... [14:54:48] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3300351 (10Cmjohnson) Great! ping when I can do the swap. [15:52:02] anyone know what happened to db2048 ? [15:53:36] musikanimal: from SAL: • 10:18 jynus: stopping and backing up db2048 in preparation for reimage [15:54:16] oh ok, so it is not gone forever! Thank you :) [15:54:27] no no :) [15:55:25] I don't know exactly what happens with a reimage, but the thing to note here is there are two tables I was using for testing on that db that are not anywhere else [15:55:33] I'm guessing they'll be restored as well? [15:55:52] yes, what we normally do is: tar.gz the data directory, move it somewhere else, reimage and then restore it [15:55:57] so it will be in the same state as it was [15:56:04] cool :) [15:56:14] yes, it is going back in 1h [15:56:18] Coming [15:56:23] as it was before [15:56:47] there you are jcrespo! I was wondering what your nick was [15:58:08] I shall wait patiently for db2048, then try to work on https://phabricator.wikimedia.org/T156318#3300291 [16:03:34] the upgrade is actually a good thing, because it will be more similar to the other production hosts [16:04:13] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300667 (10elukey) @Cmjohnson we just need to alert people a couple of days in advance, nothing more. Do you have a preferred date/time? [16:05:50] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300674 (10Cmjohnson) @elukey Let's do Thursday 1600UTC [16:08:28] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3300692 (10Ottomata) +1 [16:14:26] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3295950 (10Cmjohnson) The disk was indeed bad...so it's been replaced. I don't know if I have enough bbu's to go around.. I am swapping them out of decom'd servers. [16:14:56] 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300747 (10matmarex) It looks like a schema change to add that field was never applied? T158906 claims that it was, th... [16:15:12] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300753 (10matmarex) [16:16:33] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3300760 (10Marostegui) If it helps there are three more servers totally ready for you to decomm them: T166486 T163778 T164702 [16:17:22] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300764 (10matmarex) [16:17:59] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300419 (10jcrespo) That request never reached DBAs, someone closed it before we could even be aware o... [16:23:33] 10DBA, 06Analytics-Kanban, 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3300786 (10elukey) ``` elukey@db1046:~$ sudo megacli -pdrbld -showprog -physdrv\[32:3\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 35% in 9 Minutes. Exit Code:... [16:50:17] jynus: I'm sorry about T166570#3300858, the truth is that usually half of the times it fails because of salt and the other half because of ipmi... but feel free to ping me whenever this happens so I can get see if we have also other common failures [16:50:17] T166570: Do something to better handle run cleanups/failures - https://phabricator.wikimedia.org/T166570 [17:11:16] musikanimal: db2048 is back, now on a more recent os and mysql version [17:11:25] awesome, thank you! [22:24:13] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3302446 (10demon) >>! In T166568#3300768, @jcrespo wrote: > That request never reached DBAs, someone c... [22:45:15] 10DBA, 10MediaWiki-extensions-SecurePoll, 06Operations, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3302513 (10Reedy) el_owner has been around for 8 years, no retrospective patch was added till the bug...