[05:18:57] I am guessing you are working with labsdb hosts? [05:22:15] yes [05:22:56] getting ready to avoid the usual issues with page links during Christmas [05:23:11] sorry to bother you, it wouldn't be strange to have those failing :-) [05:23:18] (repl) [05:23:28] hehehe [05:23:32] yeah [05:23:53] there's also a big alter from yesterday [05:24:00] which made them lag [05:24:54] logical backups are still running [05:26:20] of course! it is early! :) [05:37:52] general backups incrementals are running now, when they finish I will test archive restore [05:41:01] incrementals? [05:57:25] not databases [05:57:45] ah ok :) [06:12:33] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:17:46] 10DBA, 10conftool: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 (10Marostegui) [06:20:13] 10DBA, 10conftool: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 (10Marostegui) 05Open→03Resolved [06:39:28] 10DBA, 10Data-Services: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 (10Marostegui) `wikidatawiki.pagelinks` has been imported into labsdb1010 from labsdb1012 - the host is now catching up. [06:39:35] 10DBA, 10Data-Services: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 (10Marostegui) [07:19:02] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:55:54] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:31:39] 10DBA, 10Data-Services: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 (10Marostegui) 05Open→03Resolved Host repooled [09:26:00] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:33:56] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [11:07:04] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [13:28:59] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) >>! In T107610#5726992, @Marostegui wrote: > The new hosts for es4 and es5 have been ordered and wil... [14:22:27] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:48:44] "187088 Full 0 0 Error 17-Dec-19 14:30 Offsite_Job" many times [14:49:02] I think I will need to manually purge metadata for very old jobs [14:49:08] to avoid log spam [15:52:48] o/ [15:53:05] If I wanted to get some sort of raw log of queries for a few time periods for wikidatawiki, would that be possible? [15:59:53] you mean reads? [16:13:51] addshore: would this be enough over longer periods of time? [16:13:53] https://wikitech.wikimedia.org/wiki/MariaDB/query_performance [16:14:20] write queries, and yes I think so [16:14:31] e.g.: https://wikitech.wikimedia.org/wiki/MariaDB/query_performance/coreproduction-main-20170207 [16:14:42] if you only need writes, that is even easier [16:15:44] is it possible to get all write queries in a 1 minute window for 1 host in history? [16:16:00] or is it more of a thing that needs to be done as / before the thing you need logging is happening [16:16:05] we have all write queryis of all hosts in the last month [16:16:12] that sounds perfect [16:16:36] however, the prefered workflow for analysis would be to be done on production servers [16:17:04] so that never leaves the production cluster, as it contains private data [16:17:30] ack! [16:17:53] send a ticket with what you need and so we understand what is the best way [16:18:17] e.g. if you only need something for a specific issue, we may be able to help [16:19:23] This relates to https://phabricator.wikimedia.org/T237984 [16:19:29] * addshore figures out a good question to ask [16:19:38] e.g. if you only need 1 minute at a specific time we can get you relatively easy or even one of a few transactions [16:20:09] I see [16:20:20] the exact thing I am looking at at the moment is trying to figure out when and why wbt_text.wbx_id = 110010309 was deleted [16:20:25] please write there a timestamp and a specific way to get the transaction you want [16:20:41] like a query size, or table, etc, and we can get it for you [16:20:50] s/query size/query name/ [16:21:07] something we can grep, if you get us :-) [16:21:13] ack, to find the query in the log is this essentially a grep? [16:21:20] cool, I can figure that out a bit [16:21:22] * addshore will be back [16:21:32] and we may be able to paste it on a NDA paste [16:21:43] or leave it on mwmaint [16:21:52] yeah [16:22:09] I am assuming it is probably a mostly public transaction [16:22:23] that is the only thing we are worried about, not leaking user activity [16:23:12] addshore: to give you an idea, we can use a timestamp, a gtid, a binlog offset [16:23:27] and then grep for a specific string or query comment or something [16:23:50] so, I have absolutely no idea timewise when this would have happened, im guessing greping the whole month is not ideal? [16:23:59] yep, not ideal [16:24:11] if you can narrow it somehow [16:24:25] I would be grepping for "DatabaseTermIdsCleaner::cleanTextIds" and the function making the query, "DELETE" as the query type and "110010309" as the id in the query being deleted [16:24:45] do you have a date range at least even if not a timestamp? [16:25:01] we can grep based on backups too [16:25:13] but the more general, the more dedication it will require [16:25:24] grep based on backups? backups of the query log? or something else? [16:25:34] the weekly backups [16:25:42] Sorry for not participating a lot, I am very busy with s4 commons master doing very delicate operations :) [16:25:53] eg.: "check backups when row ID was deleted" [16:25:56] essentially i have no idea when this has occurred, but I can can probably guess at some timestamps that might have caused it and see if the delete is there [16:26:20] please dump all the info you have on the ticket, the more info you give us, the better we may be able to help [16:26:24] and CC us [16:27:00] but not guarantees, binlogs are not indexed [16:27:03] do these weekly backups exist just for 1 week? or are they around for the last few weeks? [16:27:13] we have a few weeks [16:27:25] that is why I proposed as a way to narrow it if you know the row id [16:27:27] and the table [16:27:49] In that case a good starting point would be to see where "SELECT * FROM wbt_term_in_lang WHERE wbtl_id = 110010309" on the backups gets us [16:28:12] on wikidatawiki [16:28:27] I can check the values of that row on past backups, yes [16:28:39] let me double check first and I'll come back in a little bit [16:28:45] please file that there or on a separate ticket [16:28:55] thanks! knowing that this is an option is great. Okay will file a ticket :) [16:29:33] just please be reasonable, it is a quire consuming operations [16:30:43] and persistence team is quite precious- if it is needed it is needed, don't feel guilty, just we ask you to do as much as work as possible in advance [16:30:52] 0:-) [16:31:01] team's time [16:34:40] I have written the details in https://phabricator.wikimedia.org/T237984#5748184 including the table, db, id and function call [16:35:08] I guess I can not run queries against the backups myself? [16:35:41] hang on, it exists on dbstore1005 but not in production? [16:35:53] ah wait, I read it wrong [16:35:53] sorry [16:36:08] addshore: nope, sorry, it would require to be recovered first, which takes some hours per backup [16:36:29] it is much easier to grep the compressed extracts [16:36:51] hmmm, okay, maybe we don't want to bother starting to do that today and hope i come over something else in the next 24 hours [16:36:59] "grep the compressed extracts" ? [16:37:44] don't worry, thanks for the ticket, now I have something to work with [16:38:04] that should be enough I think, but please be patient [16:38:10] ack, I will be [16:38:47] At the moment we are very confused by this ticket, but I think I have it nailed down to the code doing something bad when it comes to deleting things. Amir has already been looking at this for weeks, now its my turn to tag in :) [16:39:19] if it was deleted in the last month, I will be able to tell you exactly when and how [16:39:43] <3 [16:39:57] It is also a good exercise, maybe it can be documented (I can help) as an example of a request of how to check a given row in the last XX days [16:40:12] but if it is prior to 1 month, it will be almost impossible (we only have full backups up to the last 3 months) [16:41:10] marostegui: yeah, even some tooling could be done, but it is such an unfrequent request it may not be worth it [16:41:31] yeah, that's true, but at least we can document the onliners [16:41:37] one liners [16:42:10] this is the time when mydumper shines [16:42:21] all refactoring leading to this task :-D [16:42:21] XDD [16:50:36] good news [16:51:01] what is it! https://media1.tenor.com/images/4432faa596cba606b11deb29eb23ea67/tenor.gif?itemid=4423146 [16:51:07] addshore: could you confirm data on wbt_text is public? [16:51:14] specifically, that row? [16:51:17] Yes it is :) [16:51:19] befor pasting it here [16:51:20] thanks [16:51:27] (just double checking) [16:52:30] (so much anticipation) [16:52:31] https://phabricator.wikimedia.org/P9916 [16:52:48] 2019-12-03 backup has it [16:52:53] so it was deleted somwhere between 3 and 10th dec? [16:52:53] the one on the 10 doesn't [16:53:09] please note dates are approximate and not transactionaly-safe [16:53:26] oooh, okay [16:53:29] but marsoteguis statement is (aproximately) correct [16:53:42] so a single week to check narrows it a lot [16:53:55] and being in the last month is nice because we have binlogs [16:54:10] :D [16:54:42] always ping us as soon as possible, because the more recent, the more info we have [16:54:56] binlog backups are scheduled, but not next quarter [16:55:12] so now time to grep the binlogs [16:56:02] I am going to transfer those temporarilly from the master to dbprov1001 to avoid touching the master [16:56:11] +1 [16:56:35] =]] [17:10:52] any joy with the grep? :D [17:11:08] wait, I am just transferring the files away first [17:11:30] we need wildcards on transfer.py, for + seq is not preciselly efficient :-D [17:12:28] jynus: maybe cp to a local directory and just transfer that new directory? [17:12:51] yeah, that was a great idea 5 minutes ago :-D [17:13:04] now it is about to finish [17:13:08] :) [17:17:29] grep speed will also not be optimal: 100GB+ of compressed data to scan [17:21:59] :D [17:22:30] Wednesday 4 ongoing... [17:23:02] I was scared it was going to be complicated, but apparently someong archirecture backups properly... [17:23:03] *red [17:24:08] So, once the grep hits something, we can then get the whole transaction? as we will know the exact time it happened? :) [17:25:23] I think I got it [17:25:31] yep, full transaction and time [17:25:36] and context [17:26:04] DELETE /* Wikibase\Lib\Store\Sql\Terms\DatabaseTermIdsCleaner::cleanTextIds */ and a bunch of ids [17:26:49] will tell you more in a sec [17:27:04] :DDDD [17:27:21] * addshore is looking forward to spending some more hours staring at things once I have this transaction :D [17:32:58] location is db1109-bin.002842:231062708 [17:33:18] timestamp 191205 11:09:27 [17:33:35] will paste you the whole tranaction at mwdebug1002? [17:34:03] sounds amazing! [17:34:08] thankyouu! [17:40:48] or 1001 [17:47:12] I have the .sql ready [17:47:15] any will work [17:49:17] mwdebug1001 would be great [17:49:47] * addshore is ducking out now to get food, back later [17:53:13] addshore: you should be able to read mwdebug1001:/home/jynus/wbt_text_transaction.sql [17:53:21] that is the whole transaction [17:53:38] I can get you more, but that would be enough to start with :-) [17:54:36] (200KB of sql queries) [17:55:54] the delete is inside that transaction, on line 2735 [18:08:17] nice job jynus! [18:08:37] I'm going offline now, not touching s4 master anymore for today, more tomorrow [18:08:39] byee [18:11:53] bye