[07:21:34] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820100 (10Marostegui) The following tables have been renamed as they are in the list of ignored t... [07:37:14] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2820103 (10Marostegui) Papaul has swapped the PSU between each other, so I am trying to crash it again. If this is not successful we will try to replace both PSUs with other ones. [08:10:50] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820152 (10Marostegui) I just tested what happens with a master with the old timestamp format and... [08:14:51] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2820171 (10Marostegui) And the server died again after 37 minutes (yesterday when it was plugged to the other PDU it took a lot longer to crash and it only crashed on the second attempt after 1:20h... [08:34:35] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2820195 (10jcrespo) > How is the haproxy layer failed over (between nodes) in prod atm? LVS or ucarp/VRRP or ? There is no redundancy at the mome... [08:42:04] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820213 (10Marostegui) I have audited all the tables and columns appearing on the triggers below a... [08:52:07] jynus: Can I run an alter table on db1092? Are you done with it? (it is depooled -S5) [08:52:19] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2820253 (10jcrespo) I am ok with that, but on my version, it doesn't create a temporary table (maybe you were mixing 2 different executions?)... [08:57:31] marostegui, yes, I am finished [08:57:39] thanks! [09:03:10] 10DBA: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2820273 (10Marostegui) Running ALTER table on db1092 which is currently depooled due to: T151272 [09:06:17] what is the plan with db1095? [09:06:49] jynus: I wanted to let it replicate for a few more hours and after that, I would say we are ready to copy the content to a labs box [09:06:50] (aside from leaving it there for a while) [09:07:10] so you want to copy that already, without compression? [09:07:16] or does it have compression? [09:07:23] no, it doesn't [09:07:40] at this point we need to decide if we want more than 1 shard [09:07:43] I was asking because I do not know if it is better to compress [09:07:44] or stick to 1 shard [09:08:02] Let me check one thing [09:08:05] or to load the things from dbstore [09:08:29] we have to do only one shard- but we need to do all shards eventually [09:08:57] also, do not move the tables on sanitarium2 [09:09:07] just delete them [09:09:16] no need to be careful there [09:09:24] sure [09:10:03] if something breaks it will break equaly deleting and moving, and we can reimport individual tables from dbstore1001 [09:10:13] ok, i left a file logging the lag of dbstore2001 (and 2002) every 2 minutes for the last hours and I see no big deals (maybe 40 secodns lags or stuff like that) [09:10:16] so compression looks good [09:10:32] the problem is importing s3 [09:10:43] lots of work there [09:10:48] well, what we can do is [09:10:55] 1) import s3 with a normal netcat [09:11:01] 2) run the scriopt (will take ages) [09:11:11] 3) import the rest of shards moving the tablespaces, less work [09:11:18] no? [09:11:39] are those xors? [09:11:49] haha [09:11:49] or steps [09:12:04] i do not understand [09:12:06] I think it is easier to import s3 in a normal way [09:12:12] (netcat) [09:12:19] rather than importing the ibd files [09:12:23] so we discard what we have done with s1? [09:12:35] that is why I was asking [09:12:53] ah, well, setting all this up doesn't take long now that we know how to do it [09:12:58] I can have S3 and S1 done in one day [09:13:25] well, but not populating labs if it is going to be reimported [09:14:22] yes, that might take another day [09:14:30] to copy the content to the labs servers [09:14:50] maybe labs can be done in a single copy? [09:15:04] to avoid overhead and errors? [09:15:14] yes, once we have db1095 with both shards, we can just use netcat [09:15:17] and copy all the stuff [09:15:28] that is why I was asking, coming up with a plan [09:15:37] to not work more than necessary [09:15:42] ok, let me write one in the ticket and we can discuss it there? [09:15:48] ok [09:15:54] going to do it now [09:15:55] then there is the rest of the shards [09:15:58] in the meta ticket? [09:16:00] that can wait [09:16:03] sure [09:16:09] so we want S3 and S1 at this point [09:16:23] but we may have to use sanitarium [09:16:33] I am thinking of all shards, after december [09:16:39] not to be done now [09:16:47] but they have to be eventually done [09:16:51] yeah, that can be done withouyt problems as we can import them easily [09:16:58] the problem is just S3 :) [09:17:03] so not to create ourselves problems [09:17:08] after the goal [09:17:28] yes agreed [09:17:35] I am thinking of that [09:17:44] not becaue it has to be done in a hurry [09:17:58] but to make our life easier [09:18:21] well, the goal says 1 shard, and that shard would need to be s3 to avoid problems in the future, but we can include S1 easily [09:18:25] at this point I mean [09:18:34] for example [09:18:43] if we import, import already compressed [09:18:49] yeah [09:18:55] so we do not have to work more later [09:19:01] agreed [09:19:16] note we have until 31 december [09:19:31] so we do not have to have it next week [09:19:35] sure [09:19:44] But I would like to have S3 and S1 by next week up and running [09:19:50] actually by the first days of the week [09:19:58] on db1095 at least [09:20:05] db1095 maybe [09:20:14] labsdb may be too optimixtic [09:20:34] ok, let me write the plan ion the meta ticket [09:20:36] so we can discuss [09:20:39] and letting some days checking for problems, as you intended [09:20:42] it was a good idea [09:20:43] yep [09:20:49] I never trust replication :p [09:21:49] specially not when combined with arbitrary sanitization and replication filtering [09:22:12] exactly [09:26:54] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2820295 (10Marostegui) And it crashed again after 57 minutes. @Papaul let's try what we agreed on yesterday - swap both PSUs with different ones. [09:32:53] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820298 (10jcrespo) db1095 is ignoring 'heartbeat.%', and we need that for replication lag control... [09:34:42] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820311 (10Marostegui) Yes, you are right. It was ignored on the previous tests because ROW based... [09:36:44] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2820313 (10Marostegui) The tests on T150960 are looking good so we'd need to discuss the next... [09:41:42] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2820324 (10jcrespo) Looks good. s3 will need lots of checking. Maybe I could work on an heuris... [09:44:05] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2820338 (10Marostegui) >>! In T150802#2820324, @jcrespo wrote: > Looks good. s3 will need lots... [09:48:03] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820365 (10Marostegui) hearbeat is no longer being ignored. [09:55:20] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820400 (10jcrespo) Why was being ignored, puppet? If yes, we need to change it there. [09:55:50] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2820401 (10Marostegui) No, me manually after importing S1. [10:30:53] 10DBA: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967#2820496 (10Marostegui) db1092 is done: ``` MariaDB PRODUCTION s5 localhost dewiki > show create table revision\G *************************** 1. row *************************** Table: revision Create Table: CREATE TABLE `revis... [10:31:25] 10DBA, 06Labs, 10Labs-Infrastructure: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2820497 (10jcrespo) I have been testing the roles, they work as advertised: ``` mysql -u u2... [10:46:21] 10DBA, 06Labs, 10Labs-Infrastructure: Migrate existing labs users from the old servers, if possible using roles and start maintaining users on the new database servers, too - https://phabricator.wikimedia.org/T149933#2820524 (10jcrespo) @chasemp @yuvipanda @Marostegui With the above, the changes to permissio... [10:47:01] ^roles [10:49:35] 10DBA, 06Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820534 (10Peachey88) [10:52:37] 10DBA, 06Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2819072 (10jcrespo) There is no /srv partition on silver. probably it should check /a instead of / ? [10:56:49] 10DBA, 06Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820552 (10jcrespo) The MariaDB disk space check is a legacy of the past- there should be only one disk check, and the critical (and warning) level should be higher for database... [11:11:46] 10DBA, 06Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2820601 (10Volans) @jcrespo see T151489, there is a `/srv` mount point as well as `/a` mount point, they both mount the same partition! [11:21:47] 10DBA: Meta ticket: Deploy InnoDB compression where possible - https://phabricator.wikimedia.org/T150438#2820626 (10Marostegui) p:05Triage>03Low [11:43:10] 10DBA, 06Operations, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2820680 (10jcrespo) p:05High>03Normal [11:43:51] 10DBA, 07Availability: Look into Maria 10 parallel-replication - https://phabricator.wikimedia.org/T85266#2820681 (10jcrespo) p:05Triage>03Low [11:56:05] 10DBA, 06Operations, 13Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#2820707 (10jcrespo) p:05Triage>03Normal [12:02:09] jynus: slave_exec_mode | STRICT on db1095 so ROW based is actually working otherwise we'd have seen it [12:02:56] lunch [13:12:49] 10DBA, 06Labs, 10Labs-Infrastructure: Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production - https://phabricator.wikimedia.org/T147052#2820860 (10Marostegui) [13:12:51] 10DBA: Fix dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T130128#2820861 (10Marostegui) [13:12:54] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2820858 (10Marostegui) 05Open>03Resolved This is done - I will create a different task to import all the shards to dbstore2001 and dbstore2002. [13:21:42] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2820867 (10Marostegui) Candidates masters for db1095: S3: Depooled servers: db1035 db1044 S1... [13:31:18] 10DBA: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#2820882 (10Marostegui) [13:31:51] 10DBA: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#2820882 (10Marostegui) [14:02:15] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2821001 (10Marostegui) So far so good, replication is going fine and I can see new records being i... [14:09:20] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2821109 (10jcrespo) I believe you, I am just unsure //why// it works. BTW, I've seen replication w... [14:19:30] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2821160 (10Marostegui) >>! In T150960#2821109, @jcrespo wrote: > I believe you, I am just unsure /... [14:26:54] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2821176 (10jcrespo) >>! In T150960#2821160, @Marostegui wrote: > We can do that, but how can we fi... [14:29:18] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2821195 (10daniel) I think phab behavior is misleading at best. I commented to that effect on the upstream ticket. [14:29:50] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2821197 (10daniel) @Marostegui any chance of recovering the lost titles and descriptions? [14:29:55] jynus: you have the pt-table-checksum handy? [14:30:14] like, the options [14:30:35] yes [14:30:43] but I can do that, do not worry [14:30:51] ah good :) [14:31:09] remember that we need to patch pt- to work for our environement [14:31:17] true [14:31:25] I think the patch went through [14:31:38] but I doubt it is already on debian, as it was not a security issue [14:32:46] https://phabricator.wikimedia.org/T151228#2821197 -> you have any advice on that? not now, whenever you have time to read and teach me :) [14:33:12] how busy are you today? [14:33:24] I am not much because of the freeze [14:33:28] same here [14:33:34] maybe we can discuss it now? [14:33:40] maybe we can talk this and the mariadb package [14:33:47] should take little time [14:33:53] sounds good [17:52:40] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2821602 (10jcrespo) I was working with Manuel on this, and he recovered a copy of the database from 2016-10-05 01:00:02, previous one exist. In order to help with this issue, I... [18:24:29] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2811476 (10epriestley) If it's easier, this data can also be recovered completely from the live `phabricator_calendar.calendar_eventtransaction` table, which stores the old and... [18:50:38] 10DBA: Use tls for dump backup generation - https://phabricator.wikimedia.org/T151583#2821866 (10jcrespo) [22:50:12] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2822184 (10mmodell) @jcrespo & @Marostegui: The calendar event phid for E66 is `PHID-CEVT-sggwinpwrtfsppo7pbqd` and the timestamp for the edit is `1479751874`. All of the aff... [23:14:12] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2822198 (10mmodell) This query selects the phid of the event instance, the oldValue and the newValue from for the events in question: ```lang=mysql SELECT ev.phid, tr.old... [23:23:38] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2822216 (10mmodell) this is the update query which I am not confident enough to run on live database: ```lang=mysql UPDATE calendar_event ev INNER JOIN calendar_eventtrans...