[02:21:48] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Growth-Team, and 2 others: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623 (10Nuria) 05Open>03Resolved
[05:11:06] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Marostegui) 05Open>03Resolved So this is what I meant and why I re-opened the task: ``` root@db2051:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337F5E...
[05:19:06] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s1 codfw: [] dbstore2002 [] db2094 [] db2092 [] db2088 [] db2085 [] db2072 [] db2071 [] db2070 [] db2062 [] db2055 [] db2048
[05:47:06] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui)
[05:50:33] <jynus>	 I will stop db1087 in sync with db1092 and perform the loads and fixes
[05:50:41] <marostegui>	 great
[05:50:52] <marostegui>	 db1124 is downtimed I believe, no?
[05:53:01] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s2 codfw: [] dbstore2002 [] dbstore2001 [] db2095 [] db2091 [] db2088 [] db2063 [] db2056 [] db2049 [] db2041 [] db2035
[05:53:19] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui)
[05:53:33] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui)
[05:53:48] <jynus>	 I don't think any of the 3 are
[05:54:10] <marostegui>	 mmm it must be because replication isn't working as per tendril
[05:54:12] <marostegui>	 let me check
[05:54:48] <marostegui>	 Ah, only replication lag is downtimed (by you!)
[06:08:46] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui)
[06:12:43] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui) s7 codfw: [] dbstore2001 [] db2095 [] db2087 [] db2086 [] db2077 [] db2068 [] db2061 [] db2054 [] db2047 [] db2040
[06:20:23] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10Marostegui)
[06:27:58] <marostegui>	 So on Monday let's do all the codfw -> eqiad disconnection, enablement of GTID and disconnecting s5 eqiad master from s3 codfw master then?
[06:30:10] <jynus>	 preferibly, with no ongoing schema change :-)
[06:30:20] <marostegui>	 Yep, I have stopped it :-)
[06:30:28] <marostegui>	 Because it is too much of a mess :)
[06:30:39] <marostegui>	 To deal with such complex s3 topology now
[06:30:47] <marostegui>	 I prefer to clean that up that topology first :)
[07:29:33] <banyek>	 morning gentlemen
[07:42:52] <banyek>	 I build & deploy the logrotating wmf-pt-kill package
[09:04:32] <banyek>	 The package is on place I'll uprade it on the labsdb hosts 
[09:04:52] <banyek>	 (doesn't affect anything really just adds the /etc/logrotate.d/wmf-pt-kill file, so it is harmless)
[09:05:10] <marostegui>	 I guess you need to restart the service?
[09:05:29] <marostegui>	 Monitor the log to make sure nothing weird shows up (ie: killing queries that should not be killed)
[09:08:50] <banyek>	 yes, and I'll do it
[09:15:27] <banyek>	 upgradeded the wmf-pt-kill packages, and now monitoring the logs. So far good everyhing, there's no mass killing of queries
[09:19:57] <banyek>	 everything seems good however I'll check in the next few days that after the first rotating the service starts again without any problem (it was tested, and I am pretty sure about it, but I'll check it anyways)
[09:21:40] <marostegui>	 Sure, make sure no queries are being killed after 0 seconds
[09:21:48] <marostegui>	 (that was the bug with prepared statements)
[09:36:10] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Banyek: Solve logrotating on wmf-pt-kill - https://phabricator.wikimedia.org/T206521 (10Banyek) 05Open>03Resolved
[09:39:53] <banyek>	 I'll keep open the logs today, and check them on the weekend several times
[09:40:12] <marostegui>	 Normally weird queries being killed after 0 seconds show up relatively often
[09:40:26] <marostegui>	 (when the bug is present)
[09:43:28] <banyek>	 then probably we are ok 
[09:44:23] <marostegui>	 I guess so, you didn't really touch the killer, but just keep an eye on it for the next hours so we are fully sure it is ok
[09:45:45] <banyek>	 👌
[09:46:18] <banyek>	 I prepare now the table compressions on dbstore2002 (check which ones could be, prep. the script, etc.) and I'll start it after I return from lunch
[09:46:36] <marostegui>	 great
[09:52:09] <marostegui>	 banyek: This is an important task https://phabricator.wikimedia.org/T180918 that is worth reading (hopefully it is kinda solved), but take your time and read it, it will put things in perspective about why lag and pooling/repooling are so important :)
[09:53:03] <banyek>	 the title itself sounds serious enough - ok I read this
[09:53:27] <marostegui>	 I think we have mentioned it to you, but I think we (or at least) didn't send the task itself
[09:58:37] <jynus>	 things on labs are importing
[09:58:47] <marostegui>	 \o/
[09:58:54] <marostegui>	 which tables in the end?
[09:58:56] <jynus>	 going to do some stuff while I wait for things to finish
[09:59:03] <jynus>	 I am doing pagelinks
[09:59:08] <jynus>	 but takes a lot of time
[09:59:13] <marostegui>	 no triggers on that one
[09:59:16] <jynus>	 maybe I will reconsider the reimport
[09:59:31] <jynus>	 I will see depending on how much it takes
[10:00:09] <marostegui>	 -rw-rw---- 1 mysql mysql 136G Oct 19 10:00 pagelinks.ibd
[10:00:21] <marostegui>	 fairily big indeed!
[10:13:51] <banyek>	 I go and have some food now, and then start the compression on dbstore2002.s4
[10:13:54] <banyek>	 see you afternoon
[10:14:16] <marostegui>	 bye
[10:29:48] <wikibugs>	 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) Update: pagelinks is now being re-imported on labs (this tab...
[11:38:07] <banyek>	 I downtime s4 replication icinga check on dbstore2002 as starting compressing the s4 tables
[11:46:17] <wikibugs>	 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) Compressing s4 tables with the follwing command:   ``` mysql -BN -S /run/mysqld/mysqld.s4.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHERE engine='INNODB' and ro...
[12:16:00] <jynus_>	 abusefilter has the trigger afl_id ='', but there is not such a column on the table
[12:16:19] <jynus_>	 oh, my bad
[12:16:26] <jynus_>	 it is about abuse_filter_log
[12:16:28] <jynus_>	 ignore me
[13:25:41] <banyek>	 as I see the logs of wmf-pt-kill I propose to introduce the phrase/term QUERYZILLA
[13:28:05] <banyek>	 I just seen one which was 537 lines long
[13:39:10] <godog>	 when you get a second, for your eyes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/467264
[13:39:48] <banyek>	 let me check it!
[13:47:59] <wikibugs>	 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Marostegui) >>! In T204930#4680094, @Banyek wrote: > Compressing s4 tables with the follwing command: >  >  > ``` > mysql -BN -S /run/mysqld/mysqld.s4.sock -e "SELECT table_schema, table_name FROM i...
[13:53:42] <banyek>	 I checked back the dbstore2002 s4 compression b/c of marostegui's comment, and I found this:
[13:57:35] <banyek>	 `ERROR 1062 (23000) at line 1: Duplicate entry '1149343098-' for key 'ct_rc_id'
[13:57:35] <banyek>	 `
[13:57:51] <marostegui>	 that is change_tag I guess?
[13:57:58] <banyek>	 compressing commonswiki.change_tag
[13:57:59] <banyek>	 yes
[13:58:04] <marostegui>	 is replication stopped?
[13:58:11] <banyek>	 it wasn't :/
[13:58:17] <banyek>	 that's why I went there
[13:58:29] <marostegui>	 so, change tag and tag_summary normally have that issue so my suggestion is:
[13:58:38] <marostegui>	 1) check that there is not a duplicate entry for that key
[13:58:52] <marostegui>	 2) normally stop replication before altering those tables (or all the dataset)
[13:59:21] <marostegui>	 So check that the key isn't duplicated (I doubt it is), stop replication and try the alter again
[13:59:30] <banyek>	 ok, thanks!
[13:59:59] <godog>	 banyek: thanks! appreciated
[14:06:17] <wikibugs>	 10DBA, 10Patch-For-Review: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 (10Marostegui) And this got fixed by itself: ``` root@db2033:~# hpssacli controller all show status  Smart Array P420i in Slot 0 (Embedded)    Controller Status: OK...
[14:15:41] <banyek>	 I checked, there's no duplicate entry. I disconnect replication (IO and SQL threads are muted) and resume compression with log_bin=0;
[14:15:57] <marostegui>	 sounds good
[14:16:34] <marostegui>	 with disconnect replication you mean stop slave;
[14:16:35] <marostegui>	 right?
[14:16:40] <marostegui>	 not actually reset slave
[14:16:44] <marostegui>	 ?
[14:19:20] <wikibugs>	 10DBA, 10User-Banyek: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 (10Banyek) The modified command is:    ``` mysql -BN -S /run/mysqld/mysqld.s4.sock -e "SELECT table_schema, table_name FROM information_Schema.tables WHERE engine='INNODB' and row_format <> 'COMPRESSED...
[14:20:22] <banyek>	 sure
[14:20:36] <marostegui>	 great
[14:21:24] <banyek>	 'reset slave' would be 'deconfiguring slave' but actually I think it's better to use the 'stopped slave' and 'reseted slave' phares to avoid confusion
[14:21:33] <banyek>	 (But I want to keep QUERYZILLA)
[14:21:44] <marostegui>	 yeah, we normally use disconnect for when you are actually doing a reset slave all;
[14:21:56] <marostegui>	 we tend to use stop replication for a stop slave; 
[14:23:17] <banyek>	 👍 🦖
[14:29:17] <marostegui>	 is that a dynosaur?
[14:32:26] <Reedy>	 it is
[14:35:07] <banyek>	 t-rex because there is no :godzilla: emoji :(
[14:35:09] <banyek>	 shame
[14:43:20] <wikibugs>	 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Gerrit is now 100% NoteDB from 2.16 see https://twitter.com/GerritReview/stat...
[14:46:02] <wikibugs>	 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox)
[15:01:43] <banyek>	 I added my stuff from this week to DBA-Sync etherpad, lines 85-91 in case the the SRE meeting
[16:34:22] <banyek>	 I think I call off this day
[16:34:26] <banyek>	 bye@