[06:54:11] so I ran yesterday pt-table-checksum on 3 m* hosts [06:54:39] and how was it... [06:54:45] I got errors on every single table [06:54:50] lovely [06:55:08] but now I realize I used the one on debian with the binary bug [06:55:23] aaah [06:55:26] so I have to run it all again now [06:55:35] at least you have all the commands formed already :) [06:55:51] yes, but verly little room for fixes [06:56:43] what do you mean? [06:56:53] at least it doesn't give out an error for nonsensical things like the heartbeat table [06:57:00] haha [07:06:06] the good news is that the new checksum only finds differences on the mysql table - which is to expect due to version differences [07:06:13] \o/ [07:06:51] so far [07:30:41] I am around btw. Ready for the failover :-) [07:30:50] buenas dias [07:30:51] :-) [07:31:00] γεια [07:31:08] :-) [07:31:57] so we can do it earlier \o/ [07:34:07] fine by me :-) [07:39:03] I would wait a little, I have yet some homework to do [08:12:25] I think I will merge the patches now and disable puppet on old and new master [08:12:33] \o/ [08:13:59] do you want to take the proxies and the killing of bad connections? [08:14:14] as you seemed worried about that? [08:14:17] sure [08:14:46] e.g. prepare a pt-kill execution to kill some or all connection on db1020? [08:15:11] yeah :) [08:15:11] and the actual switchover of the proxies (reload) [08:15:25] I will take care of the replication steps and read only [08:15:40] cool [08:17:34] I am ready [08:18:29] on the proxies no need to disable puppet [08:18:37] it only takes effect when reloaded [08:18:41] yep [08:20:22] I am going to merge the patches, can you run puppet on proxies and check it has on file the new config? [08:20:30] yep [08:25:16] last pt-checksum came very clean, I only didn't have the time to check fully the large otrs table [08:25:24] nice! [08:25:43] I will have to repeat it for the other hosts without the bug [10:08:05] 10DBA, 10ContentTranslation, 10Language-2018-Jan-Mar, 10Patch-For-Review, 10Schema-change: CX2: Register the version used to start a translation - https://phabricator.wikimedia.org/T187986#4052644 (10jcrespo) Please don't just add individual people to reviews- workflow is well documented at https://www.m... [10:10:38] 10DBA, 10ContentTranslation, 10Language-2018-Jan-Mar, 10Patch-For-Review, 10Schema-change: CX2: Register the version used to start a translation - https://phabricator.wikimedia.org/T187986#4052653 (10jcrespo) Another tip to speed up reviews is to add Manuel and me to the reviews as "reviewers". [11:13:23] 10DBA, 10ContentTranslation, 10Language-2018-Jan-Mar, 10Patch-For-Review, 10Schema-change: CX2: Register the version used to start a translation - https://phabricator.wikimedia.org/T187986#4052729 (10Nikerabbit) Thanks for the tips. I was reading https://wikitech.wikimedia.org/wiki/Schema_changes#Workflo... [11:20:15] 10DBA, 10ContentTranslation, 10Language-2018-Jan-Mar, 10Patch-For-Review, 10Schema-change: CX2: Register the version used to start a translation - https://phabricator.wikimedia.org/T187986#4052756 (10jcrespo) Yes, beta is fully "yours", you just merge and it gets deployed automatically, I think- I am not... [11:38:48] 10DBA, 10ContentTranslation, 10Language-2018-Jan-Mar, 10Patch-For-Review, 10Schema-change: CX2: Register the version used to start a translation - https://phabricator.wikimedia.org/T187986#4052786 (10Marostegui) In addition to what Jaime said, once you add the #blocked-on-schema-change if you want to reu... [12:31:25] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052908 (10Marostegui) p:05Triage>03Normal [12:32:06] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052922 (10Marostegui) [12:32:08] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052908 (10Marostegui) [12:33:50] 10DBA, 10Operations, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4052935 (10jcrespo) [12:33:52] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052934 (10jcrespo) [12:33:54] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052936 (10jcrespo) [12:33:57] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4052937 (10jcrespo) [12:35:11] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052908 (10jcrespo) Sorry, I did this wrong. [12:36:02] 10DBA, 10Operations, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4052954 (10jcrespo) [12:36:04] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052953 (10jcrespo) [12:36:06] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052955 (10jcrespo) [12:36:31] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052908 (10jcrespo) [12:36:33] 10DBA, 10Operations, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4049095 (10jcrespo) [12:36:55] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052908 (10jcrespo) This is blocked on closing fully T189656, not the other way around. [12:38:17] 10DBA: Decommission 1020 - https://phabricator.wikimedia.org/T189773#4052962 (10jcrespo) [12:38:19] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4052963 (10jcrespo) [12:38:45] 10DBA: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4052908 (10jcrespo) [12:39:55] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052967 (10jcrespo) a:05Marostegui>03jcrespo [12:40:43] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4052968 (10Marostegui) [12:51:38] jynus: marostegui Hey, quick question. If we replace value of one of column (gradually) with zero in a very large table (wb_terms) is it going to free up space or still you need to optimize it [12:52:13] (zero in case of int, empty string in case of varchar, etc.) [12:54:32] could it be a NULL? it is more compact (best case scenario, 0 bytes; and it will be at least 4) [12:54:41] *an int [12:55:56] that requires a schema change to make the column nullable [12:56:25] then writing to it will make not much difference (just generate io) [12:57:23] an empty string would be better than any other string, but it depends on the details [12:59:49] What about storage, is it going to have any impact on that? I'm sorry if you answered it and I didn't understand [13:00:58] sorry, I lack of context- having to optimize a table is a good thing [13:01:14] it meast it shrunk [13:01:17] *means [13:01:29] it is not something to avoid [13:02:37] given the growth of that large table, we may not need to optimize it, as it will just stop growing for some time [13:02:57] we need to talk for longer, but I rarely have long enough time [13:04:00] focus on the ideal normalization first "the right thing to do", we can go over physical optimization much later on the road [13:05:23] as in, normalization and maybe logical partition should already be a large win, later we can see how to make those changes properly [13:06:01] with logical partition I mean something like putting labels and descriptions for properties on a separate table (not suggesting that, it is just an example) [13:06:28] not sure if that helps^ if not, let's schedule some meeting [13:06:34] Amir1: ^ [13:09:11] jynus_: I would be very happy to have the meeting [13:09:32] when it works for you? [13:09:39] I need to read the document first [13:10:32] jynus_: the first thing that we are doing now is to replace term_search_key with empty string, it probably reduce the size to 70% of what it is now, or probably way more [13:11:10] yes, that could help [13:11:30] I assume that is a deleted column [13:11:57] it could* help later to make alters faster [13:12:45] it's going to be unused because we are using elastic as search backend but dropping the column is not good for third parties, that's the reason the simplest solution was to have config and switch it off (empty string) [13:13:12] by third parties I mean other installation of wikibase extension. [13:16:51] jynus_: ^ [13:17:03] I hope that makes sense to you [13:18:10] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#4053049 (10Paladox) [13:18:15] well, I would make it on design [13:18:15] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4053048 (10Paladox) [13:18:19] nullable [13:20:22] and while we do that alter table, you can remove the content there [13:20:44] I don't know, it is difficult to say something without knowing the details [13:21:16] this weeks are also complicated and there is holidays coming, can you propose a meeting for the begining of April? [13:28:01] jynus_: I don't have access to your calendars, I share mine with you [13:28:30] mmm [13:28:32] let me see [13:28:38] give me your email on pm [14:19:26] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4053182 (10jcrespo) Hi, after Mark's comments and Kaldari response, this seems reasonable-- we will deploy as is, but... [15:02:07] jynus: I would like to do some maintenance on db2039 (codfw master), any objections? [15:07:35] which section? [15:07:40] s6 [15:07:59] all ok for me [15:08:06] I think backups finished [15:08:23] yeah, I checked them [15:08:29] cool then [15:08:38] they took more this time, because of the archiving [15:08:49] like 14 hours [15:08:52] *24 [15:09:57] I like the new backups, it is easy to see what's going on and what's finished and all that :) [15:10:36] no errors this time [15:10:44] but if ERROR string is seen [15:11:04] it aborts the section process and goes to the next one without rotating the latest [15:11:26] next is to monitor ERRORs / notifiy on error, etc [15:11:39] yeah [15:11:45] that's pretty cool :) [15:12:15] archiving seems to have worked nicely [15:13:10] with tar -tvf zuwiktionary.gz.tar the contents seem right [15:13:39] and only 911 files rather than 120K [15:14:50] yeah, that's such a big win [15:15:34] 13G Mar 14 20:58 enwikivoyage.gz.tar [15:15:43] that definitely has to go to s5 [15:16:01] and cebwiki: 9.8GB [15:16:54] wow enwikivoyage has increased quite a lot [15:17:57] it was like that back on january [15:18:03] T184805 [15:18:03] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [15:18:29] I think debwiki grew more [15:18:31] *ceb [15:18:37] that huge table only probably [15:20:13] did you depool something on s5? [15:20:20] in s5? [15:20:22] no [15:20:31] why? [15:20:35] there is one server with 40K QPS [15:20:54] db1100 [15:20:59] I haven't touched s5 [15:21:08] db1100 has weight 500, but that has not been touched [15:22:24] I wasn't supposing it, I saw it weird and just asked if there was maintenance ongoing [15:22:33] no, nothing there [15:22:38] it is api + traffic [15:22:42] so I can investigate [15:23:02] How's db1082? (the other api)? [15:23:23] strange, that host says it is only doing 2K ops [15:23:33] maybe a monitoring glitch? [15:23:52] db1082 has weight 1in api and db1100 has 3 in api [15:24:01] 40k is quite insane [15:24:07] yeah [15:24:07] maybe it is a monitoring glitch indeed [15:24:48] I swear I saw 40K on https://grafana.wikimedia.org/dashboard/db/mysql-aggregated [15:24:53] on the "top QPS hosts" [15:25:37] look with this parameters : https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=now-2d&to=now [15:26:07] do you see also db1100 41.12K WPS at the bottom ? [15:26:16] *QPS [15:26:42] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=3&fullscreen&orgId=1&from=1520954795086&to=1521127595086&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [15:28:37] madness [15:28:40] is that even possible? XD [15:28:55] it has to be a bad formula or something [15:29:01] yeah, it is impossible [15:29:15] well , actually I saw 100K on a server once [15:29:19] before SSDs [15:29:28] (before the empire) [15:29:51] to be fair, they were 100K "SET variable..." [15:29:54] haha the empire XD [15:30:16] https://www.youtube.com/watch?v=SB6brOn_RI4 [15:30:42] hahahahaha [15:30:48] sorry to ping you, 90% of the cases it is "yeah, I am doing maintenance" because you so so much maintenance [15:31:00] no no, don't worry at all, I prefer that [15:31:00] thank you for all the work you do [15:31:10] Because I can forget/mess up things with so many changes [15:31:19] so _please_ ping me when you see weird things [15:31:22] I will try to look more to !log first [15:31:31] so I don't do it unncessarily [15:31:32] no, no, ping me [15:31:41] yes, but I can still look at the logs first [15:54:23] going to do the same thing on s5 codfw master (db2052) [15:55:09] cool [15:55:49] I will check and fix the backup rotation script and call it a day [15:56:49] cool! :) [15:59:44] should we resolve the switchover ticket and any further issues report them separatelly? [15:59:49] +1! [16:05:37] 10DBA: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4052908 (10jcrespo) [16:05:41] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4053461 (10jcrespo) [16:05:43] 10DBA, 10Operations, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4053455 (10jcrespo) 05Open>03Resolved a:03jcrespo This is technically done, not without issues, but not a lot of real actionable once those are fixed. We can prepare an incide... [16:17:56] I think I don't have to change the rotation script, it should work with the new format [16:18:09] we will know for sure on the next iteration [16:18:39] there is 3.8T free, so that should be enough even if nothing is deleted [17:17:16] 10DBA: Grant access to pmiazga to db1112 and db1112 (MCR tests hosts) - https://phabricator.wikimedia.org/T189799#4053837 (10Marostegui) [17:17:49] 10DBA: Grant access to pmiazga to db1112 and db1112 (MCR tests hosts) - https://phabricator.wikimedia.org/T189799#4053850 (10Marostegui) 05Open>03Resolved p:05Triage>03Normal [17:45:14] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#3891821 (10Niharika) 05Open>03Resolved a:03Niharika Thanks @jcrespo, @mark and @kaldari for getting this to a r... [17:46:18] 10DBA: Grant access to pmiazga to db1112 and db1112 (MCR tests hosts) - https://phabricator.wikimedia.org/T189799#4054020 (10pmiazga) @Marostegui thanks for creating this task. From now on I'll create phab tickets when requesting production access. [17:56:54] 10DBA: Grant access to pmiazga to db1112 and db1112 (MCR tests hosts) - https://phabricator.wikimedia.org/T189799#4054045 (10Marostegui) The user you will be using has the following grants: ``` SELECT, INSERT, UPDATE, DELETE ```