[06:05:32] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2741092 (10Marostegui) `templatelinks` table finished its compression and went from 135G to 39G. I did not optimize this table first, so we are not sure how much the compression rate was though. As the server... [06:28:50] 10DBA: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2741095 (10Marostegui) >>! In T148967#2738775, @Anomie wrote: >>>! In T148967#2738422, @Marostegui wrote: >> ``` >> #./software/dbtools/osc_host.sh --host=xxx --port=3306 --db=dewiki --table=revision --method=ddl --no-replicate "DROP... [06:29:11] 10DBA, 13Patch-For-Review: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305#2741096 (10Marostegui) Before proceeding check: T148967#2741095 [07:41:12] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement proxysql both for labs and for later production usage - https://phabricator.wikimedia.org/T148500#2724798 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1011.eqiad.wmne... [08:05:08] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2741130 (10jcrespo) a:05jcrespo>03None [08:09:52] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement proxysql both for labs and for later production usage - https://phabricator.wikimedia.org/T148500#2741133 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1011.eqiad.wmnet'] ``` and were **ALL** successful. [09:02:20] MySQL has discard partition tablespace but mariadb doesn't :_( [09:03:40] https://jira.mariadb.org/browse/MDEV-10568 [09:03:58] no answer from them in 2 months [09:05:04] 10DBA, 10MediaWiki-API, 07Performance: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077#2741216 (10jcrespo) [09:06:48] marostegui, so much fun! [09:07:04] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dbproxy1011 [09:08:18] Oh nice!!! it got reimaged! :) [09:09:56] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2741235 (10Marostegui) S4 cannot be imported as it has tables with partitions and MariaDB currently does not support moving partitions (https://jira.mariadb.org/browse/MDEV-10568) I assume the way to do it woul... [09:10:33] wait wait [09:10:41] tell me [09:10:46] how come s4 has tables with partitions? [09:10:52] templatelinks [09:10:56] aside from the special slave? [09:11:01] that is new [09:11:06] Ah, ok, so I was not mad :) [09:11:08] It sounded weird to me [09:11:16] But I was like: "ok, might be normal" [09:11:26] which slave did you use? [09:11:37] check db2065 or db2058 [09:11:43] those two have it [09:11:47] they all have partitioning? [09:12:04] I will check all now, give me a sec [09:13:34] yes [09:13:38] All of them for the templatelinks table [09:13:51] I guess they were rebuilt from the wrong slave? [09:14:09] or from the right one [09:14:14] how is it on eqiad? [09:14:15] XD [09:14:18] checking [09:14:55] Only two have them (the special slaves) [09:15:05] then yes, you are right [09:15:12] copy from eqiad [09:15:20] You want me to remove the partitions from at least one slave from codfw so I can leave them clean? [09:15:26] well [09:15:31] not sure if it is worth it [09:15:46] how large it is? [09:16:10] it took 16 hours to compress and repartition logging on s4 eqiad [09:16:29] what I do not know is if that is documented [09:16:38] https://phabricator.wikimedia.org/P4300 [09:16:51] no, I mean on the partitioning schemas [09:17:38] nope https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/s4-pager.sql [09:18:07] the big question is, does it work? [09:19:11] uf [09:19:29] marostegui, wait, is only that table partitioned there? [09:20:00] because that would make no sense [09:20:08] that's a yes [09:20:16] let me see [09:20:33] yep [09:20:39] Looks so in db2065 [09:21:36] maybe this started and stopped mid-deployment? [09:22:02] but makes no sense [09:22:10] it is partitioned by namespace [09:22:18] so most are going to be on the template namespace [09:23:18] yes, we have a 55GB partition [09:23:33] and then a few MB for the others [09:23:48] not sure what this achieved [09:24:22] Yeah, the partitioning makes no sense, from the point of view of how the data is organized [09:24:45] I would like to leave one slave clean at least, so we can use it to clone others if needed at some point [09:24:58] yes [09:25:03] We can try to remove partitioning from one [09:25:11] it may take 5 hours to fix, however [09:25:19] well, I can use eqiad meanwhile [09:25:21] to copy data over [09:27:45] jynus: If no objections I would start with db2052 (dump) (inactive), vslow [09:27:51] (I am creating a task) [09:28:26] sorry, db2051 [09:28:44] gah! db2058 so many terminals opened [09:36:51] yes, 58 is the dump one [09:37:20] oki [09:37:42] thanks [09:38:38] 10DBA: codfw: Fix S4 commonswiki.template links partitions - https://phabricator.wikimedia.org/T149079#2741270 (10Marostegui) [09:39:03] 10DBA: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079#2741283 (10Marostegui) [09:41:49] I've created a new dashboard on kibana: https://logstash.wikimedia.org/app/kibana#/dashboard/DBReplication [09:43:40] lag per wiki? [09:43:57] no, it is replication errors in general [09:44:08] but it is too verbose [09:44:25] maybe it is better to skip lag errors if they are below XX seconds? [09:44:28] or something like that? [09:44:43] that cannot be done on kibana [09:44:53] but on mediawiki [09:46:09] 10DBA: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079#2741301 (10Marostegui) This is now running - db2058: ``` ./software/dbtools/osc_host.sh --host=db2058.codfw.wmnet --port=3306 --db=commonswiki --table=templatelinks --method=ddl --no-replicate "remove PARTITION... [09:47:31] jynus: question for you :) [09:48:02] jynus: Given that I have to depool a server from s4 eqiad to use it to copy it to dbstore… [09:48:12] If I do a stop slave; the LB should get rid of it [09:48:27] But I would still like to " #" it to make sure it is well depooled and for being clear [09:48:45] is that fine or would you, yourself, stop slave it and let the LB take care of the rest? [09:48:56] do not stop it [09:49:04] that is ok for emergencies [09:49:12] but not for maintenance [09:49:21] Cool, so we are on the same page :) [09:49:27] it will still generate a lot of logging [09:49:32] a lot of errors [09:49:44] queries will quickly change host [09:50:02] too many issues [09:50:13] gotcha [09:50:19] Makes sense [09:50:30] the aim is in the future have a proxy to being able to do it securilly and easily and quickly [09:51:13] which one are you going to use? [09:51:21] I was checking db1059 [09:51:36] And I was wondering if db1068 would be able to handle all the API traffic for a few hours [09:51:51] on s4, probably [09:52:00] s5 is having issues [09:52:24] now? [09:52:32] for a few days [09:52:47] didn't you see the conversation/tickets I created? [09:53:14] The ApiQueryRecentChanges ones? [09:53:21] yes [09:53:49] Yes I saw that one [09:54:19] Sorry I got confused now as we were talking about S4 [09:54:24] But yes, yes [09:54:50] well, it is not like api servers are different [09:55:00] they are just queried or not [09:56:41] You think a rewrite is doable? [09:57:16] why not? [09:57:38] Don't know how willing people are normally to rewrite code to fix bad queries [09:57:52] well, the bug is reported [09:58:49] if people do not want to do their job is a different problem... [10:00:50] jynus: As db1068 will handle the API traffic itself, I have removed it [10:00:53] from the main traffic service until db1059 is back. [10:01:03] Basically I wanted it to handle less general traffic meanwhile so it can handle all the API one [10:01:18] less than 1 per thosand? [10:01:30] well, that is better than nothing [10:01:30] while leaving it out of the lag checks? [10:01:46] Ah, I see I see [10:01:48] Good point [10:02:02] Only thought about not stressing it [10:02:11] But didn't think about the lag checks [10:02:20] it is not obvious [10:02:27] the logic has changed [10:02:35] and I think it will change soon [10:02:43] to "wait only for the majority" [10:03:12] but for now, all servers that we wait for should have => 1 [10:03:21] I see [10:03:26] I have amended it [10:03:29] Thanks for the explanation [10:58:51] I just realised that db1059 does have partitioning too damn it! only the new servers don't have it (db1081,84,91 and the master .40 [11:06:51] 10DBA, 10Cognate, 10Wikidata, 15User-Addshore: Initial Cognate DB review - https://phabricator.wikimedia.org/T148988#2741378 (10Addshore) [11:07:58] jynus: *poke* just reminding you of the above ticket :) [11:09:11] I do not understand the columns [11:09:32] why title, key? [11:14:34] addshore^ [11:19:22] It is also not clear to me how many of those you intend to create [11:19:31] if one per wiki, or only one [11:21:18] please chat to me later, I think there are some normalization issues there [11:22:46] jynus: how many as in how many tables? rows? [11:23:45] 10DBA, 10Cognate, 10Wikidata, 15User-Addshore: Initial Cognate DB review - https://phabricator.wikimedia.org/T148988#2741414 (10Addshore) [11:27:02] 10DBA, 10Cognate, 10Wikidata, 15User-Addshore: Initial Cognate DB review - https://phabricator.wikimedia.org/T148988#2741417 (10Addshore) [11:27:11] added some more notes [12:44:22] addshore, I have still some questions [12:53:04] jynus: okay! [14:43:45] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2741807 (10Marostegui) I have been testing `gtid_domain_id` and I have seen good and a mix of bad news. Good news: Changing it dynamically doesn't affect the connected replication threads. Bad news: So far I... [15:07:21] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2741841 (10Marostegui) And while still testing it inserting data in both masters and stopping/starting the slaves replication got broken: ``` Oct 25 15:00:53 ubuntu-512mb-lon1-01 mysqld: 161025 15:00:53 [ERROR... [15:23:35] 10DBA, 10MediaWiki-API, 07Performance: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077#2741216 (10Anomie) Hmm. Yeah, theoretically it could realize that it could do this by combining multiple ranges from an appropriate index like we'd... [15:46:56] 10DBA, 10MediaWiki-API, 07Performance: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077#2741927 (10jcrespo) Well, I would wait a bit before working on this, to check if we really want to support this kind of complex queries in the first... [16:16:11] 10DBA, 10Cognate, 10Wikidata, 15User-Addshore: Initial Cognate DB review - https://phabricator.wikimedia.org/T148988#2742025 (10jcrespo) @Addshore I have some questions, they are not long, but they depend on each other, so I would love to chat with you when you find the time, as that will simplify the inte... [16:17:32] 10DBA, 10Cognate, 10Wikidata, 15User-Addshore: Initial Cognate DB review - https://phabricator.wikimedia.org/T148988#2742029 (10Addshore) >>! In T148988#2742025, @jcrespo wrote: > The main blocker is: is everything that this table has 100% public, or will it contain some private information. If it is fully... [16:17:44] jynus: I am also EU and free to talk whenever you would like! :) [16:19:45] I will disconnect soon, but if you have some time tomorrow, I have some questions [16:20:27] I need to understand the meaning of some of those columns, I do not know if they reference pages or titles [16:29:37] 10DBA, 10MediaWiki-API, 07Performance: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077#2742038 (10Anomie) >>! In T149077#2741927, @jcrespo wrote: > Well, I would wait a bit before working on this, to check if we really want to support... [16:31:40] jynus: okay! speak tomorrow :) [16:33:51] 10DBA, 10MediaWiki-API, 07Performance: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077#2742048 (10jcrespo) > I'd think the (rc_namespace, rc_type, rc_timestamp) index would be useful here as a better target for unionizing than the exis... [16:43:46] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2742063 (10Marostegui) `logging` (50G->19G) and `categorylinks` (54G->19) have been compressed. [16:47:42] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742068 (10jcrespo) [16:56:34] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742092 (10jcrespo) [16:59:44] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742093 (10jcrespo) @Papaul From the output I would replace disks #4, #7 and #11, which should be the ones with the light on. Disk #1 has some media errors, but I suppose we can live with it for now. [17:02:36] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2742097 (10Marostegui) >>! In T146261#2741841, @Marostegui wrote: > And while still testing it inserting data in both masters and stopping/starting the slaves replication got broken: > > ``` > Oct 25 15:00:53... [17:05:51] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2742104 (10jcrespo) Smells like a bug? Like a race condition for a very specific replication state? [17:24:37] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: CirrusSearch SQL query for locating pages for reindex performs poorly - https://phabricator.wikimedia.org/T147957#2710120 (10debt) Hi @jcrespo and @Marostegui - can we get a status on this issue, please? Thanks! [17:27:47] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2742174 (10Marostegui) Could be a bug indeed I am going to do a fresh install tomorrow to discard any issues as I have been playing around with binlogs, relay logs, the gtid mysql table etc. [17:48:29] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: CirrusSearch SQL query for locating pages for reindex performs poorly - https://phabricator.wikimedia.org/T147957#2742260 (10jcrespo) @debt It is on "next", which means it will be one of the tasks we will work immediately... [19:05:10] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: CirrusSearch SQL query for locating pages for reindex performs poorly - https://phabricator.wikimedia.org/T147957#2742502 (10EBernhardson) I'm not really sure what to do here either though, to the best of my knowledge this... [19:11:25] 10DBA, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work), and 2 others: CirrusSearch SQL query for locating pages for reindex performs poorly - https://phabricator.wikimedia.org/T147957#2742512 (10EBernhardson) Same thing for codfw, run from wasat.codfw.wmnet to not have the roundtrip latency:... [21:00:36] 10DBA, 06Operations, 10ops-codfw: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2742860 (10RobH) 05Open>03stalled So these are 300GB SEAGATE ST3300657SS. 3.5" 15K SAS disks, and we don't keep any of these spare. (We've moved on to SSDs in new databases.) I'll create a sub-task i... [21:37:46] 10DBA, 10MediaWiki-API, 07Performance: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki - https://phabricator.wikimedia.org/T149077#2743035 (10Tgr) This query asks for the recent content changes. That (separating content changes from changes in supplemental namespaces) is central...