[02:21:48] 10DBA, 10Analytics, 10Analytics-Kanban, 10Growth-Team, and 2 others: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10Nuria) 05Open>03Resolved [05:11:06] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Marostegui) 05Open>03Resolved So this is what I meant and why I re-opened the task: ``` root@db2051:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F5E... [05:19:06] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s1 codfw: [] dbstore2002 [] db2094 [] db2092 [] db2088 [] db2085 [] db2072 [] db2071 [] db2070 [] db2062 [] db2055 [] db2048 [05:47:06] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [05:50:33] I will stop db1087 in sync with db1092 and perform the loads and fixes [05:50:41] great [05:50:52] db1124 is downtimed I believe, no? [05:53:01] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s2 codfw: [] dbstore2002 [] dbstore2001 [] db2095 [] db2091 [] db2088 [] db2063 [] db2056 [] db2049 [] db2041 [] db2035 [05:53:19] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [05:53:33] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [05:53:48] I don't think any of the 3 are [05:54:10] mmm it must be because replication isn't working as per tendril [05:54:12] let me check [05:54:48] Ah, only replication lag is downtimed (by you!) [06:08:46] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [06:12:43] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s7 codfw: [] dbstore2001 [] db2095 [] db2087 [] db2086 [] db2077 [] db2068 [] db2061 [] db2054 [] db2047 [] db2040 [06:20:23] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) [06:27:58] So on Monday let's do all the codfw -> eqiad disconnection, enablement of GTID and disconnecting s5 eqiad master from s3 codfw master then? [06:30:10] preferibly, with no ongoing schema change :-) [06:30:20] Yep, I have stopped it :-) [06:30:28] Because it is too much of a mess :) [06:30:39] To deal with such complex s3 topology now [06:30:47] I prefer to clean that up that topology first :) [07:29:33] morning gentlemen [07:42:52] I build & deploy the logrotating wmf-pt-kill package [09:04:32] The package is on place I'll uprade it on the labsdb hosts [09:04:52] (doesn't affect anything really just adds the /etc/logrotate.d/wmf-pt-kill file, so it is harmless) [09:05:10] I guess you need to restart the service? [09:05:29] Monitor the log to make sure nothing weird shows up (ie: killing queries that should not be killed) [09:08:50] yes, and I'll do it [09:15:27] upgradeded the wmf-pt-kill packages, and now monitoring the logs. So far good everyhing, there's no mass killing of queries [09:19:57] everything seems good however I'll check in the next few days that after the first rotating the service starts again without any problem (it was tested, and I am pretty sure about it, but I'll check it anyways) [09:21:40] Sure, make sure no queries are being killed after 0 seconds [09:21:48] (that was the bug with prepared statements) [09:36:10] 10DBA, 10Patch-For-Review, 10User-Banyek: Solve logrotating on wmf-pt-kill - https://phabricator.wikimedia.org/T206521 (10Banyek) 05Open>03Resolved [09:39:53] I'll keep open the logs today, and check them on the weekend several times [09:40:12] Normally weird queries being killed after 0 seconds show up relatively often [09:40:26] (when the bug is present) [09:43:28] then probably we are ok [09:44:23] I guess so, you didn't really touch the killer, but just keep an eye on it for the next hours so we are fully sure it is ok [09:45:45] 👌 [09:46:18] I prepare now the table compressions on dbstore2002 (check which ones could be, prep. the script, etc.) and I'll start it after I return from lunch [09:46:36] great [09:52:09] banyek: This is an important task https://phabricator.wikimedia.org/T180918 that is worth reading (hopefully it is kinda solved), but take your time and read it, it will put things in perspective about why lag and pooling/repooling are so important :) [09:53:03] the title itself sounds serious enough - ok I read this [09:53:27] I think we have mentioned it to you, but I think we (or at least) didn't send the task itself [09:58:37] things on labs are importing [09:58:47] \o/ [09:58:54] which tables in the end? [09:58:56] going to do some stuff while I wait for things to finish [09:59:03] I am doing pagelinks [09:59:08] but takes a lot of time [09:59:13] no triggers on that one [09:59:16] maybe I will reconsider the reimport [09:59:31] I will see depending on how much it takes [10:00:09] -rw-rw---- 1 mysql mysql 136G Oct 19 10:00 pagelinks.ibd [10:00:21] fairily big indeed! [10:13:51] I go and have some food now, and then start the compression on dbstore2002.s4 [10:13:54] see you afternoon [10:14:16] bye [10:29:48] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update: pagelinks is now being re-imported on labs (this tab... [11:38:07] I downtime s4 replication icinga check on dbstore2002 as starting compressing the s4 tables [11:46:17] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) Compressing s4 tables with the follwing command: ``` mysql -BN -S /run/mysqld/mysqld.s4.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHERE engine='INNODB' and ro... [12:16:00] abusefilter has the trigger afl_id ='', but there is not such a column on the table [12:16:19] oh, my bad [12:16:26] it is about abuse_filter_log [12:16:28] ignore me [13:25:41] as I see the logs of wmf-pt-kill I propose to introduce the phrase/term QUERYZILLA [13:28:05] I just seen one which was 537 lines long [13:39:10] when you get a second, for your eyes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/467264 [13:39:48] let me check it! [13:47:59] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Marostegui) >>! In T204930#4680094, @Banyek wrote: > Compressing s4 tables with the follwing command: > > > ``` > mysql -BN -S /run/mysqld/mysqld.s4.sock -e "SELECT table_schema, table_name FROM i... [13:53:42] I checked back the dbstore2002 s4 compression b/c of marostegui's comment, and I found this: [13:57:35] `ERROR 1062 (23000) at line 1: Duplicate entry '1149343098-' for key 'ct_rc_id' [13:57:35] ` [13:57:51] that is change_tag I guess? [13:57:58] compressing commonswiki.change_tag [13:57:59] yes [13:58:04] is replication stopped? [13:58:11] it wasn't :/ [13:58:17] that's why I went there [13:58:29] so, change tag and tag_summary normally have that issue so my suggestion is: [13:58:38] 1) check that there is not a duplicate entry for that key [13:58:52] 2) normally stop replication before altering those tables (or all the dataset) [13:59:21] So check that the key isn't duplicated (I doubt it is), stop replication and try the alter again [13:59:30] ok, thanks! [13:59:59] banyek: thanks! appreciated [14:06:17] 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 (10Marostegui) And this got fixed by itself: ``` root@db2033:~# hpssacli controller all show status Smart Array P420i in Slot 0 (Embedded) Controller Status: OK... [14:15:41] I checked, there's no duplicate entry. I disconnect replication (IO and SQL threads are muted) and resume compression with log_bin=0; [14:15:57] sounds good [14:16:34] with disconnect replication you mean stop slave; [14:16:35] right? [14:16:40] not actually reset slave [14:16:44] ? [14:19:20] 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) The modified command is: ``` mysql -BN -S /run/mysqld/mysqld.s4.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHERE engine='INNODB' and row_format <> 'COMPRESSED... [14:20:22] sure [14:20:36] great [14:21:24] 'reset slave' would be 'deconfiguring slave' but actually I think it's better to use the 'stopped slave' and 'reseted slave' phares to avoid confusion [14:21:33] (But I want to keep QUERYZILLA) [14:21:44] yeah, we normally use disconnect for when you are actually doing a reset slave all; [14:21:56] we tend to use stop replication for a stop slave; [14:23:17] 👍 🦖 [14:29:17] is that a dynosaur? [14:32:26] it is [14:35:07] t-rex because there is no :godzilla: emoji :( [14:35:09] shame [14:43:20] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Gerrit is now 100% NoteDB from 2.16 see https://twitter.com/GerritReview/stat... [14:46:02] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) [15:01:43] I added my stuff from this week to DBA-Sync etherpad, lines 85-91 in case the the SRE meeting [16:34:22] I think I call off this day [16:34:26] bye@