[02:13:50] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4099403 (10Anomie) >>! In T187089#4097279, @Anomie wrote: > * https://gerrit.wikimedia.org/r/c/417039/ exi... [05:19:19] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4099491 (10Marostegui) >>! In T191193#4097801, @Papaul wrote: > switch port information when ready to move db2039. This i just a note for when we are ready to do the move. > >... [05:25:38] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4089198... [05:26:55] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4094572 (10Marostegui) In the last 12h this has caused almost 1.5M errors: https://logs... [05:27:20] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4099499... [05:36:35] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4099526 (10Marostegui) >>! In T187089#4099403, @Anomie wrote: >>>! In T187089#4097279, @Anomie wrote: >> *... [05:43:12] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4099534... [05:44:19] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4099537... [05:44:41] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4099539... [05:49:02] 10DBA: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4099540 (10Marostegui) [05:49:14] 10DBA: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4099552 (10Marostegui) p:05Triage>03Normal [05:51:25] 10DBA: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4099554 (10Marostegui) For s1 I'd suggest db2055. It is in a different rack and row. [05:57:15] 10DBA: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275#4099560 (10Marostegui) In order not to have db2033 as a candidate master (and as unique slave for codfw x1) as it has a faulty BBU T184888)) I would like to do... [08:31:31] good morning, and thanks for all the follow ups on everything ;) [08:31:55] no, thank you for taking care of everything while we were out :) [08:33:36] not everything, ofc jaime saw labsdb1009 first, I had looked at tendril shortly before it crashed, and then saw his task :D [08:34:10] I guess he saw the dbproxy alert [08:36:07] when you have time, not urgent at all, see my proposal in T191020 for 'fixing' it, unless you've already plans do run the checksum script there too [08:36:08] T191020: labsdb1004: s51541_sulwatcher.logging is out of sync - https://phabricator.wikimedia.org/T191020 [08:36:33] Ah yes! I will take a look later :) [08:36:34] Thanks [08:38:00] no, thank you! LMK if I can be of any help [09:10:23] 10DBA, 10DC-Ops, 10Operations, 10media-storage, 10ops-codfw: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4100007 (10Marostegui) 05Open>03Resolved I think we can consider this resolved. Thanks guys! [11:20:48] marostegui: I created T191282 [11:20:48] T191282: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending - https://phabricator.wikimedia.org/T191282 [11:21:02] just FYI regarding logging errors monitoring [11:22:02] Thanks - I just susbcribed [11:22:12] volans: we have a "reimport from master" script [11:22:15] 2nd feb? [11:22:22] mmm [11:22:34] jynus: \o/ wow [11:22:35] let me see, may I gave you the wrong one [11:22:54] jynus: I mean this sentence: Probably related to train deployment: starting on 28 March, but getting higher on the 2 Feb [11:23:09] oh, that is wrong [11:23:12] 2 april [11:23:21] :) [11:24:00] I created / monitored like 3 ongoing issues with errors related to load balancer/queries [11:24:04] maybe more [11:24:16] I mark them with https://phabricator.wikimedia.org/tag/wikimedia-log-errors/ [13:08:17] volans: operations/software / dbtools/reimport_from_master.sh [13:08:48] not precisely a masterpiece, but it is nice if you are panicking and replication is down, speciall on non-core servers [13:09:10] ack, good to know [13:16:00] marostegui: sulwatcher bots down, please verify that indeed no table writtings occur and you can start the reimport process [13:16:11] ok! will do now [13:16:12] thanks [13:17:22] the logging table has not had any writes since: [13:17:37] 20180403122420 [13:18:54] it just logs if it matches a regex from the s51541_sulwatcher.regex table [13:20:40] let me know when you're done & thanks for caring [13:22:22] Hauskatze: This is now done [13:22:28] Hauskatze: You can start them again [13:24:24] bots up again [13:24:49] can you try a test write or something? [13:29:03] that'd require me to create an account matching an abusive regex :) [13:29:27] but I can add a regex to trigger some [13:29:40] Yeah, don't know how hard it is. I think it is fine anyways, just wanted to make sure. I am going to close the task and if something arises we can always open it :) [13:30:33] done marostegui [13:30:52] I see them [13:30:53] all good :) [13:30:57] thanks! [13:31:09] closing then! [13:31:09] thanks for taking care of this marostegui! [13:31:26] it was you who fixed it in the right moment! :) [13:36:14] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4100523 (10Anomie) >>! In T191116#4099497, @Marostegui wrote: > In the last 12h this ha... [13:40:16] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4100529 (10Marostegui) Indeed - we relay a lot on that dashboard for our daily work, sp... [13:48:16] marostegui: I just did MariaDB [s51541_sulwatcher]> UPDATE regex set r_active = 0 where r_id = 1; -- is it okay in labsdb1004 ? [13:48:59] Hauskatze: don't worry about labsdb1004, this is our responsability, and you should not be able to break it [13:49:09] one things that you can do to help, however [13:49:20] is to make sure you use InnoDB and not myisam [13:49:26] Hauskatze: that is the slave actually [13:49:37] don't know what's either of those jynus [13:49:39] Hauskatze: You should've done it on the master [13:49:43] (labsdb1005) [13:49:52] run SHOW cREATE TABLE yourtable; [13:50:00] marostegui: sorry, I mean to check the slave [13:50:09] at the end it must say: ENGINE=Something [13:50:22] if it doesn't say ENGINE=InnoDB [13:50:25] Hauskatze: any write, do it on labsdb1005, as it is the master and will get it replicated to labsdb1004 [13:50:35] you can do ALTER TABLE yourtable ENGINE=InnoDB; [13:50:46] marostegui: not a problem as I don't think I can do writes on 1004 directly [13:50:47] that will protect your tables against corruption in case of a crash [13:51:04] yeah, that is a passive slave [13:51:13] it is only in case the master goes down [13:51:36] I think I can update some tables jynus the way you mention [13:51:57] but, should I use 'engine=InnoDB' on UPDATE / DELETE as well? [13:53:22] 10DBA, 10MediaWiki-Database, 10MediaWiki-Special-pages, 10Security, 10Wikimedia-log-errors: Wikimedia\Rdbms\Database::tableName: use of subqueries is not supported this way. - https://phabricator.wikimedia.org/T191116#4100563 (10Marostegui) I can already see the decrease: https://logstash.wikimedia.org/g... [13:54:45] Hauskatze: it is a one time change [13:54:52] if it is not already innodb [13:54:59] only on CREATE TABLE would be enough [13:55:06] or alter, if the tables exist [13:55:55] so...: ALTER TABLE setup ENGINE='InnoDB' [13:55:59] ; [13:56:44] MariaDB [s51541_sulwatcher]> ALTER TABLE setup ENGINE=InnoDB; [13:56:44] Query OK, 0 rows affected (1.36 sec) [13:56:44] Records: 0 Duplicates: 0 Warnings: 0 [13:57:06] I see all using innodb now [13:57:16] so everthing will work exactly the same [13:57:22] no need to add anything anymore [13:57:31] you now enjoy innodb :-) [13:57:40] if the server crashes, no more replication issues [13:58:30] https://phabricator.wikimedia.org/P6928 [13:58:46] and depending what you do, faster access [13:58:56] thank you :-) [13:59:46] default_storage_engine was already InnoDB, so maybe the issue was only with older tables [14:01:11] I just did the 'setup' table, are the others using innodb too? [14:01:24] all of them are, see link above [14:01:52] chachi [14:02:13] https://www.urbandictionary.com/define.php?term=chachi [14:02:48] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4089198... [14:05:31] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4089198... [14:07:07] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, 10Wikimedia-log-errors: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4100608... [14:07:33] jynus: all s51541_.* tables we use migrated to use InnoDB now [14:07:49] cool [14:07:59] that will minimize problems in the future [14:08:16] although you know- it is impossible to guarantee problem-free forever :-) [14:09:30] thankfully we have a good team of dbas [14:09:57] ha ha, we have a really bad one, but it is what it is :-) [14:10:10] except for manuel, he is the cool one [14:11:06] We have a DBA team formed by Jaime and then manuel who constantly breaks stuff and get the DBA team to fix it :) [14:11:39] lovelly [14:13:55] I think --I'll propose it first-- we may need to DROP TABLE s_51541_sulwatcher.regex as it contains a lot of outdated stuff. OTOH I can stop the bot listening to them UPDATEing the regexes to r_state = 0 [14:14:42] TRUNCATE TABLE table_name; may be better though, as I'd not have to create the table again afterwards [14:15:02] it is a tiny table (255 rows) so proably truncating + starting from 0 is cleaner [14:15:10] Assuming you can afford losing those rows [14:16:00] I think we can, but it's a shared project so I have to consensuate with the others --usually they don't care, still... -- [14:58:35] marostegui: remember I did an analysis of comment deduplication savings? [14:58:42] do you remember the name of the task [14:58:53] ufff [14:59:11] let me try some searches [14:59:31] https://phabricator.wikimedia.org/T162138 [14:59:33] found it [14:59:51] haha I was looking for exactly, because I remember that word [15:43:47] 10DBA, 10MediaWiki-Database, 10Patch-For-Review, 10PostgreSQL, 10Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#4100988 (10Krinkle) [15:44:27] 10DBA, 10MediaWiki-Database, 10Patch-For-Review, 10PostgreSQL, 10Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3139502 (10Krinkle) [16:19:09] 10DBA, 10MediaWiki-Platform-Team, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4101112 (10Anomie) [16:23:27] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Ministry-Of-Magic, 10Wikidata-Sprint-2018-02-28: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279#4101143 (10jcrespo) Adding #DBA to support the setup of this test db. [18:33:51] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4101652 (10Cmjohnson) [18:33:54] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4101653 (10Cmjohnson) [18:33:57] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4101651 (10Cmjohnson) 05Open>03Resolved [18:36:42] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4101663 (10Cmjohnson) [18:36:56] 10DBA, 10Operations, 10Patch-For-Review: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4101668 (10Cmjohnson) [18:36:59] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4101669 (10Cmjohnson) [18:37:01] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4065400 (10Cmjohnson) 05Open>03Resolved [18:40:32] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4101682 (10Cmjohnson) - ge-2/0/0 { - description db1001; - disable; - } - ge-2/0/10 { - description db1011; - disable; - } - ge-2/0/... [18:40:52] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4101687 (10Cmjohnson) [18:40:54] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4101688 (10Cmjohnson) [18:40:57] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4101685 (10Cmjohnson) 05Open>03Resolved - ge-2/0/0 { - description db1001; - disable; - } - ge-2/0/10 { - description db1011; - disa... [18:41:25] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3235306 (10Cmjohnson) [18:41:28] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4101689 (10Cmjohnson) 05Open>03Resolved