[02:11:17] 10DBA: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Krinkle) [06:25:14] 10DBA, 10Data-Services: Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Marostegui) a:03Marostegui [06:25:59] 10DBA, 10Data-Services: Prepare and check storage layer for wawikisource - https://phabricator.wikimedia.org/T269432 (10Marostegui) a:03Marostegui [06:26:20] 10DBA, 10Data-Services: Prepare and check storage layer for skrwiktionary - https://phabricator.wikimedia.org/T268458 (10Marostegui) a:03Marostegui [06:28:54] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp fields of jobs table - https://phabricator.wikimedia.org/T268391 (10Marostegui) [06:33:12] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp fields of jobs table - https://phabricator.wikimedia.org/T268391 (10Marostegui) s3 progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1124 [] db1123 [] db1112 [] db1095 [] db1078 [] db1075 [] clouddb1017 [] cloud... [06:35:45] 10DBA, 10Data-Services: Prepare and check storage layer for skrwiktionary - https://phabricator.wikimedia.org/T268458 (10Marostegui) This has been sanitized. I have tested the triggers creating my user. I am now running a check data on labsdb1009, 1010, 1011, 1012 as well as clouddb1020:3315 and clouddb1016:33... [06:35:49] 10DBA, 10Data-Services: Prepare and check storage layer for wawikisource - https://phabricator.wikimedia.org/T269432 (10Marostegui) This has been sanitized. I have tested the triggers creating my user. I am now running a check data on labsdb1009, 1010, 1011, 1012 as well as clouddb1020:3315 and clouddb1016:331... [06:35:53] 10DBA, 10Data-Services: Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Marostegui) This has been sanitized. I have tested the triggers creating my user. I am now running a check data on labsdb1009, 1010, 1011, 1012 as well as clouddb1020:3315 and clouddb1016:331... [06:35:56] 10DBA, 10Data-Services: Prepare and check storage layer for skrwiki - https://phabricator.wikimedia.org/T268412 (10Marostegui) This has been sanitized. I have tested the triggers creating my user. I am now running a check data on labsdb1009, 1010, 1011, 1012 as well as clouddb1020:3315 and clouddb1016:3315 Af... [07:11:22] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp fields of jobs table - https://phabricator.wikimedia.org/T268391 (10Marostegui) [07:11:31] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp fields of jobs table - https://phabricator.wikimedia.org/T268391 (10Marostegui) 05Open→03Resolved All done [07:58:38] 10DBA, 10Patch-For-Review: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Marostegui) [08:06:16] 10DBA, 10Patch-For-Review: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Marostegui) [08:40:03] 10DBA, 10Data-Services: Prepare and check storage layer for skrwiktionary - https://phabricator.wikimedia.org/T268458 (10Marostegui) - Private data check was clean - `_p` database created - Grants added ` *****labsdb1009***** Database (skrwiktionary%) skrwiktionary skrwiktionary_p GRANT SELECT, SHOW VIEW ON `... [08:40:10] 10DBA, 10Data-Services: Prepare and check storage layer for wawikisource - https://phabricator.wikimedia.org/T269432 (10Marostegui) - Private data check was clean - `_p` database created - Grants added ` *****labsdb1009***** Database (wawikisource%) wawikisource wawikisource_p GRANT SELECT, SHOW VIEW ON `wawi... [08:40:16] 10DBA, 10Data-Services: Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Marostegui) - Private data check was clean - `_p` database created - Grants added ` *****labsdb1009***** Database (eowikivoyage%) eowikivoyage eowikivoyage_p GRANT SELECT, SHOW VIEW ON `eowi... [08:40:21] 10DBA, 10Data-Services: Prepare and check storage layer for skrwiki - https://phabricator.wikimedia.org/T268412 (10Marostegui) - Private data check was clean - `_p` database created - Grants added ` *****labsdb1009***** Database (skrwiki%) skrwiki skrwiki_p GRANT SELECT, SHOW VIEW ON `skrwiki\\_p`.* TO 'labsd... [08:40:24] 10DBA, 10Data-Services: Prepare and check storage layer for madwiki - https://phabricator.wikimedia.org/T269440 (10Marostegui) - Private data check was clean - `_p` database created - Grants added ` *****labsdb1009***** Database (madwiki%) madwiki madwiki_p GRANT SELECT, SHOW VIEW ON `madwiki\\_p`.* TO 'labsd... [08:40:42] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for skrwiktionary - https://phabricator.wikimedia.org/T268458 (10Marostegui) a:05Marostegui→03None [08:40:56] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for wawikisource - https://phabricator.wikimedia.org/T269432 (10Marostegui) a:05Marostegui→03None [08:41:15] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Marostegui) a:05Marostegui→03None [08:41:29] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for skrwiki - https://phabricator.wikimedia.org/T268412 (10Marostegui) a:05Marostegui→03None [08:41:42] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for madwiki - https://phabricator.wikimedia.org/T269440 (10Marostegui) a:05Marostegui→03None [08:42:04] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for skrwiktionary - https://phabricator.wikimedia.org/T268458 (10Marostegui) Reminder, the views need to be created on the new hosts too, so this would be: labsdb1009 labsdb1010 labsdb1011 labsdb1012 clouddb1016:3315 clo... [08:42:07] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for wawikisource - https://phabricator.wikimedia.org/T269432 (10Marostegui) Reminder, the views need to be created on the new hosts too, so this would be: labsdb1009 labsdb1010 labsdb1011 labsdb1012 clouddb1016:3315 clou... [08:42:13] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Marostegui) Reminder, the views need to be created on the new hosts too, so this would be: labsdb1009 labsdb1010 labsdb1011 labsdb1012 clouddb1016:3315 clou... [08:42:21] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for skrwiki - https://phabricator.wikimedia.org/T268412 (10Marostegui) Reminder, the views need to be created on the new hosts too, so this would be: labsdb1009 labsdb1010 labsdb1011 labsdb1012 clouddb1016:3315 clouddb10... [08:42:26] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for madwiki - https://phabricator.wikimedia.org/T269440 (10Marostegui) Reminder, the views need to be created on the new hosts too, so this would be: labsdb1009 labsdb1010 labsdb1011 labsdb1012 clouddb1016:3315 clouddb10... [08:43:14] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [08:55:10] jynus: regarding x2, I am still thinking about STATEMENT vs ROW. ROW would be easier to set up from scatch, on the other hand, I am thinking about future schema changes, where ROW is a pain in the ass to handle [08:55:27] Yesterday I was ok, ROW, but today I am more: STATEMENT is easier to handle in the future [08:55:38] Cause being realistic, we are not going to migrate to ROW anytime soon [08:56:02] And everytime we have to handle schema changes on x1, it is a damn pain [08:56:14] especially those that change data type, that need to be deployed directly on the master, which is a no go [08:56:55] maybe data structure is fixed and no schema changes are needed? [08:57:15] that's what I was told [08:57:20] But you know how things change over time XD [08:57:52] I will check again with them before I give them green light to start creating tables [09:01:37] may I ask why x1 schema changes are a pain? (I don't have the context). My guess would be due to per-host changes and potential replication breakage, but in the case of low-throughput x1 tables, why not just using pt-online-schema-changes and do them all at once? [09:02:26] cause with ROW if you alter the slave first and change datatype, then replication will break :( [09:02:40] see my counter question :-) [09:02:51] I don't trust pt-online-schema change [09:03:14] Too many bad experiences with 1) triggers 2) metadata locking for the echo tables on enwiki [09:03:30] ok, I wasn't aware of #2 [09:03:41] I was aware of it for enwiki and commonswiki changes [09:03:45] but not for x1 [09:03:54] yeah, for the rest of the wikis is okish, but enwiki, wikidata and commons....on x1 it is a pain [09:04:08] it is a lottery [09:04:28] so I don't have a super-confident answer to your question [09:04:48] that's why I am in doubt about ROW vs STATEMENT [09:04:49] but I would be ok with deciding to go for statement for primary instances [09:05:08] the thing that ROW fixed is 3-chain replication (e.g. for labs) [09:05:22] what do you mean? [09:05:26] so I would prefer row for secondary replications [09:05:52] as in, P (stmt)-> S (row)-> T [09:06:07] where P is primary, Secondary, Tertiary [09:06:15] as in 3rd in the chain [09:06:19] on replication [09:06:22] Yeah, I meant STATEMENT only for the primary master [09:06:35] (and candidate) [09:06:59] but that is also a counter statement for using STATEMENT, as we'd need primary and candidate with statment and just one host with ROW [09:07:07] so my only warning is you are seeing the pain of schema changes, but maybe don't remember the pain of replication breaking due to statement :-) [09:07:48] yeah, candidate is no problem as long as it doesn't replicate further [09:08:44] I will double check with them again (I checked months ago) how likely is the need to change the schemas in the future [09:08:44] for example, I think the latest replica breakages were due to statemant + tables with no PKs [09:08:53] yeah, for sure [09:09:21] just to be clear, I am not saying we should go A or B, am being a bit of devil's advocate [09:09:32] I think there is no right answers, sadly [09:09:52] just "choos the less pain" :-) [09:09:59] yes, there is no right answser [09:10:06] And I change my mind everyday [09:10:09] I wonder, however, what could we do to reduce schema change impact? [09:10:28] as in, what could we add e.g. automation? to reduce schema change pain? [09:10:28] we can use gh-ost [09:10:36] but we still have metadata locking issues anyways [09:10:37] but that doesn't fix the metadata issue [09:10:39] yeah [09:10:44] if altering one of those tables [09:10:52] but at least we get rid of my fear of triggers [09:11:08] doesn't btz ghost have orchestator awarenes? [09:11:11] *BTW [09:11:15] the good thing is that x1 doesn't need many schema changes, but given we are in 2020, maybe tomorrow we'll receive a massive task for it [09:11:45] ok, so another counter question- we saw that we could change it online, right? [09:11:58] binlog? [09:11:59] yes [09:12:02] if that is right, we could change decision easily later [09:12:02] but we never trusted it [09:12:15] I recall we ran into something years ago, didn't we? [09:12:17] I think it just needed a flush logs or something, doesn't? [09:12:29] yeah, that for sure [09:12:31] or finish ongoing queries, etc [09:12:39] But I think we saw something years ago [09:12:41] I cannot remember it [09:12:47] so what I am going to is "just do whatever you think is wiser now" [09:12:54] that's what I don't know! [09:12:55] :) [09:12:56] as we can later correct [09:13:12] I will double check with them again to see how confident they are on not needing many changes [09:13:30] thanks for the chat, it was useful [09:13:31] my only "blocker" is wikireplicas, but I think that's no an issue here [09:13:37] wikireplicas? [09:13:49] It is not being replicated to wikireplicas [09:13:50] they needed row for not breaking every single time :-( [09:14:04] but I am guessing this will not be replicated there [09:14:12] so not a blocker/consideration? [09:14:15] nope [09:14:19] not for now [09:14:19] so there it is [09:14:50] sometimes is better to decide something, and we can reevaluate later [09:15:02] rather than keeping in analysis-paralisis [09:15:05] I'll have another chat with them [09:15:06] :-) [09:15:29] btw, different topic [09:15:35] re:backups of x2 [09:16:21] a new backup is not a huge overhead, so even if they say it is not needed, don't put the bar too high to set them up [09:16:27] like we did for misc dbs [09:16:45] No, I just asked them, I didn't want you to do any work in case they finally say no [09:16:57] I am waiting for their answer, I was planning to create a task for you if they wanted them in the end [09:17:10] yeah, that is what I mean, I will be waiting for what you say [09:17:20] cool [09:17:39] but that if they say it could help speed up recovery even if it is not canonical data and the dataset is not that big, we can still do them [09:17:49] sounds good [09:17:51] as this is both backups AND provisioning [09:17:52] I will let you know [09:18:08] e.g. to speed up new setup of host in the future [09:19:01] for context, I found "backup all the things" is sometimes easier than cherrypicking [09:19:05] :-) [09:19:15] haha [09:19:45] small performance penalty but least sanity impact [09:29:33] 10DBA, 10Data-Services, 10User-Kormat, 10cloud-services-team (Kanban): Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 (10Marostegui) @Bstorm keep in mind that while this is important to have it running (puppet keeps reporting failures as the daemo... [09:29:50] 10DBA, 10Data-Services, 10User-Kormat, 10cloud-services-team (Kanban): Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 (10Marostegui) [10:20:14] 10Blocked-on-schema-change, 10DBA: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187 (10Marostegui) p:05Triage→03Medium a:03Marostegui I haven't found anything on codesearch that could have `user_properties_property` index forced [10:27:50] 10Blocked-on-schema-change, 10DBA: Increase size of content_models.model_id - https://phabricator.wikimedia.org/T270053 (10Marostegui) a:03Marostegui [10:29:40] 10DBA: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) a:03Marostegui [10:55:04] https://jira.mariadb.org/browse/MDEV-24416 in it says 10.5, but just in case, you might want to subscribe to it [11:02:31] jynus: hi! :) [11:02:32] yeah, I knew that [11:02:42] marostegui: that is why I don't trust that log at all [11:03:19] we do more heuristics to check the backup completes ok (process code, files are in place, final metadata, etc.) [11:04:02] makes sense yeah [11:04:03] elukey: do we need to announce the restart? [11:04:50] nono I see that no stat100x tcp conns are currently present on the host, and people us them once in a while for querying, nothing that keeps pulling data afaik [11:05:16] ok, then I will log it and do a restart there, just be around in case we need input [11:05:22] super [11:05:25] thanks a lot [11:06:23] I will take the time to do a full host restart for updates [11:08:11] (will take a bit to downtime everything) [11:13:01] had the same issue thank kormat, blocked on sda2, which is the swap? [11:13:28] oh, this could be actually a real issue- if swapping was happening due to high memory pressure [11:13:37] yeah, that is what was found [11:13:46] and also swapoff as a workaround [11:13:48] so good that we restart now and not later [11:14:23] so the 90 is a good alerting limit [11:20:32] elukey: all back to green, if you want to check that https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&orgId=1&refresh=5m&var-server=dbstore1004&var-datasource=thanos&var-cluster=mysql&from=1608095957761&to=1608117557762 doesn't go back to 90% because of T270112? Otherwise, we should be ok for a couple of months [11:20:32] T270112: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 [11:20:45] we have monitoring anyway [11:21:50] we could also do dbstore1005, which is at 86.40%? [11:28:02] jynus: I have to run errand in a bit but if you have time we can do this afteroon (or during the next days) [11:28:15] if it is a simple restart I can take care of it (if you want) [11:28:28] sure, we can talk later [11:32:10] ack I'll ping you! [13:06:31] I am generating high read db traffic (the equivalent of multiple table dumps) on db1150, don't worry, this is me for backup testing [13:06:46] ok, ta [13:42:18] cdanis: for x2 dbctl new section, do you prefer me to create a task for your or you're ok using T269324? [13:42:18] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [13:43:03] marostegui: using that task is fine [13:43:34] cdanis: cool, I will probably ping you later this week to get x2 section up although I am not going to include the hosts just yet [13:43:36] I can nitpick the review if needed:D [14:08:49] 10DBA, 10Patch-For-Review: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Marostegui) [15:19:14] deleted images is in a bad place. there are differences between metadata and backend storage (in some wikis) of almost 10% [15:19:51] that is, a 10% of images either are missing from the backend storage or from metadata [15:21:44] jynus: remember that oldimage table doesn't even have a PK, so that's a cans of worms [15:22:00] yeah [15:22:35] it could be it, that things are not that bad, but the same file has been registered several times [15:22:46] (on metadata) [15:22:50] could be the PK issue definitely [15:23:35] I may ask for your help and kormat's next Q for some cleanup [15:23:53] assuming I could find the errors systematically [15:24:08] sure [15:24:14] let me know how I can help [15:24:59] I think the easiest one would be to delete empty records, and I will coordinate with you on that [15:25:09] sounds good [15:25:19] but not this year :-) [15:30:57] cdanis: I added https://wikitech.wikimedia.org/w/index.php?title=Dbctl&type=revision&diff=1891693&oldid=1867596 feel free to change anything [15:31:52] thanks! [15:31:59] I just merged the patch, you can see some stuff like [15:32:01] {"x2": {"master": "PLACEHOLDER", "min_replicas": 0, "readonly": false, "ro_reason": "PLACEHOLDER", "flavor": "regular"}, "tags": "datacenter=codfw"} [15:32:13] excellent, I will take care of all that [15:32:19] you can also see the placeholder instances [15:32:23] {"db1151": {"host_ip": "0.0.0.0", "port": 3306, "sections": {}, "note": ""}, "tags": "datacenter=eqiad"} [15:32:26] yep [15:32:27] cool [15:32:30] thanks a lot [15:32:35] but since they don't have any sections associated with them, no changes in the generated config [15:32:50] makes sense, I will take care of all that [15:32:51] thank you [15:33:06] thank you and let me know if you need helpo [15:33:11] thanks :* [15:33:32] can scrounge some dbctl parts and ship them to you [15:33:52] haha [15:34:28] don't worry, I will take care of all that config, which won't happen before january anyways [15:35:01] thanks for the help [15:35:16] Hopefully for future new sections I won't bother you with that doc! :) [15:35:54] 10DBA, 10Patch-For-Review: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Marostegui) [15:52:07] np! [15:52:27] marostegui: as for the doc, I think you probably do not need a dbctl config commit after editing the sections -- only the instances [15:52:42] it's associating instances with a section that affects anything in the output config [15:52:46] rest looks good though! [17:00:12] cdanis: thanks - I will change it! [17:43:17] 10DBA: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) I built db1154 with s1, s3, s5 and s8 and as soon as I started s1, there were InnoDB errors - which I was kinda expecting after seeing T267090#6629364 So s1, is not to be trusted and needs to be bui... [17:50:06] 10DBA: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10jcrespo) > and needs to be built from the sanitarium masters Not saying it has to happen this other way, but in case it could help, to provide an alternative to save doing manual work- could it be rebuilt from... [17:55:25] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Bstorm) a:03Bstorm This should all be taken care of in the cookbook...however, I need to make sure the meta_p script runs in a test first. Once I've done t... [18:04:40] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for skrwiki - https://phabricator.wikimedia.org/T268412 (10Bstorm) To clinic duty, please wait until I finish a manual run of maintain-meta_p T269427#6696324 [18:07:34] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for madwiki - https://phabricator.wikimedia.org/T269440 (10Bstorm) To clinic duty, please do this after I've tested maintain-meta_p at T269427#6696324 [18:19:56] 10DBA: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Probably codfw hosts will have the same problem Also, I prefer not to mix data between DCs, especially cause the existing clouddb hosts were cloned from sanitarium's masters in eqiad [23:58:26] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for eowikivoyage - https://phabricator.wikimedia.org/T269427 (10Bstorm) a:05Bstorm→03nskaggs @nskaggs I believe you are clear to use the cookbook method to deploy these wiki databases to the replica views.