[06:16:43] 10DBA: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) Users looking good: ` mysql.py -hdb1111 mysql -e "select user,host from user where user like 'wik%';" +-----------+-----------+ | User | Host | +-----------+-----------+ | wikiadmin | 10.192.% | | wikiuser... [06:19:28] 10DBA, 10Patch-For-Review: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) 05Open→03Resolved 10.4 has been tested in production on s1 for 3 weeks, there are pending things that already have their tracking task. The task for placing 1 10.4 host per section will be... [06:20:30] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [06:20:42] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) p:05Triage→03Medium [06:27:35] db1111 is now in s8 [06:27:39] addshore Amir1 ^ [06:27:45] it will take a few hours for me to fully pool it [06:28:01] It is running buster and mariadb 10.4 (the others run stretch+10.1) [06:28:15] We have tested 10.4 in s1 for 3 weeks, and it is looking good [06:28:27] I will slowly give more weight to db1111 throughout the day today [06:29:37] 10DBA, 10Patch-For-Review: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) db1111 placed in s8 with weight 1 just to check grants and other possible errors. db1111 runs buster + 10.4 whilst the others run stretch+10.1 (T246604) [06:32:16] 10DBA, 10Cleanup: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10Marostegui) As this wiki has been removed everywhere, this triggered the check_private data alert to let us know there's "private" data on sanitarium hosts (and labs hosts) as this w... [06:42:30] 10DBA, 10Patch-For-Review: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) Events enabled ` root@db1111.eqiad.wmnet[ops]> show events; +-----+--------------------------+------------------+-----------+-----------+------------+----------------+----------------+----------... [06:48:45] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [08:00:06] Morning [08:00:26] marostegui: yay! [08:00:50] marostegui: one way to bring xpu down might be to uncompress just 1 of the tables [08:01:09] Not sure of that's possible / what extra disk usage it would take [08:02:33] not worth the time I think, that will take quite long too [08:02:51] with this new server I think we'll be in a much better shape [08:04:21] there used to be issues in the past when uncompressing tables too [08:04:25] like weird behaviours [08:06:56] okay! [08:07:57] The reason I had that thought is, the wbt_item_terms table ends up with some number of rows being selected for one of the query patterns, and I guess thats where the biggest cpu hit ends up happening [08:08:34] yeah, but that table, compressed, is 150GB already... [08:08:47] so let's leave it like that XD [08:09:08] ack! :) yes, it is also the biggest table :p [08:09:08] hehe [08:14:03] * addshore is going to restart the rebuild script for the tables now :) [08:14:50] ok [08:15:03] running now :) [08:22:21] marostegui: give me a ping when you think I'll be able to start increasing the # of items again :) [08:22:39] addshore: will do, let me get the new server with a bit more traffic first [08:23:21] yupp :) [08:23:32] Will this server end up taking a bulk of the load too? [08:23:39] If so, should I also do the cache warning on it? [08:23:52] so far I have only been cache warning db1126 [08:26:48] addshore: yeah, it will take a bunch of load [08:26:57] addshore: if you can warm it up, that'd help yeah [08:29:46] will do! :) [08:30:16] thanks [08:30:19] Do you want me to warm it up now for what it would already be reading? [08:30:20] how long does it usually take? [08:30:38] addshore: yeah, it has some weight already, but warm it up for that yeah [08:30:43] Once you are done I will increase a bit its weight [08:30:57] ack! [08:33:51] cache is warming, It'll probably take a couple hours to get all the way to Q6 million I believe [08:34:06] cool [08:34:10] thanks [08:34:20] 1.5 mins per batch of 100k ish [08:34:23] I will probably increase a bit its weight a bit more [08:34:27] In like 10 mins [08:34:33] So it will get more live reads [08:40:18] Thanks for sorting this extra db server btw :) [08:40:39] no problem! [08:49:15] marostegui: thanks! [08:54:27] Amir1: <3 [08:54:33] addshore: gave it a bit more weight on the LB [09:02:20] woo! [09:03:17] marostegui: cache warm is at 4.4 million for db1111 [09:03:20] (first pass) [09:03:28] i can do a second if you want :) [09:03:35] sure, it wouldn't hurt [09:12:44] starting pass 2 now [09:12:49] excellent, thank you [09:27:04] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:33:42] second warm at 3 million :) [09:33:49] finished? [09:33:57] 3 more million to go [09:34:00] ah cool [09:34:07] let me know when done, and I will give it more traffic [09:34:07] I think youd be fine to increase the load a bit more though [09:34:27] looking at the numbers everything that I read is already in the cache now :) [09:34:34] cool [09:34:36] increasing a bit [09:37:33] are 1126 and 1111 roughly the same size? [09:37:44] they are the same HW yes [09:38:31] coool! [09:48:42] cache warm second pass all done for db1111 [09:49:19] I'm going to start warming 6-8 million on both db1111 and db1126 [09:49:29] cool! [09:49:37] I will increase the weight a bit more in a bit [10:09:59] I assume we might get a new DB for codfw too at some point then? :P [10:10:18] yep, that's the plan. To keep codfw at the same level [10:10:31] We are trying to see if we can get another extra host in eqiad (and two in codfw) [10:10:36] We'll see [10:11:56] I wrote a ticket last week too to see if we can add "group" things to the queries run by client and repo for wikidata stuff, so that we can split the query patterns out a bit more between dbs, and also that can mean that a slow wikidata.org repo replica / being overloaded by connections or whatever, wouldnt make page rendering on enwiki for example slow [10:12:09] IMO that would be a very valuable thing for us to do [10:12:42] yeah [10:12:51] that can probably help indeed [10:13:14] we normally leave the vslow replica without traffic [10:13:27] but in this case, on friday we gave it traffic, otherwise we'd struggle with the other servers [10:13:44] and of course, it was the end of the month and dumps start the at the start of the next... [10:13:47] so we ran into that [10:13:58] yupp :P [10:14:05] I reduced its weight yesterday and it went back fine [10:14:07] dumps starting on a sunday, woo! :P [10:14:20] They just start at the start of the month [10:14:26] Bad timing in this case! :) [10:14:38] Anyways, decreasing weight fixed it [10:32:17] marostegui: okay for me to make reads go from 6to 8 million once i have cache warmed? [10:32:38] let me increase its weight first [10:34:23] okay! :) [10:36:55] So I've seen this alert "Uncommitted dbctl configuration changes" firing falsely a bit a couple of times (I am guessing like now, too) [10:37:06] yep [10:37:07] only the one on cumin2001 [10:37:09] I just committed a change [10:37:21] I don't know if there is lag or something [10:37:50] in my case it stayed up for quite some time, even after I forced a recheck [10:38:18] we may want to ask if there could be etcd delay betweend dcs or something [10:38:40] no issue right now, just telling you it happened to me once or twice last week [10:39:29] he: https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1&from=1583043822897&to=1583118597421&panelId=3&fullscreen [10:41:37] is db1111 now part of s8? [10:41:49] should I add it to the panels (I didn't automated it)? [10:42:01] add if when you can yep please [10:42:07] doing [10:42:12] thanks [10:44:15] Done: https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation [10:45:07] ooooooh [10:46:20] my 6-8 million cache warming is done now, and starting to read form the new store for that range shouldnt result in many more QPS, but will wait for a go ahead from you marostegui :) [10:46:57] yes, give me a min [10:51:42] addshore: go for it [10:51:50] marostegui: ack! :) [10:55:58] on 8 million now [10:56:08] cool [10:56:11] We are going into a meeting now [10:57:01] ack! [12:31:39] will go to 10 million at some point in this next 45 mins [13:09:43] @ 10 million now [13:10:26] seeing basically not impact with each deployment now [13:10:34] will warm the next bit of the cache 10 -> 12 now [13:17:31] cache warming db1126 for 10->12 had quite the initial cpu and process list spike, but it recovered quickly [13:20:47] * addshore will hold of reading from the next section for quite a bit [14:16:15] I am going to give db1111 more weight [14:21:00] awesome! [14:21:42] just warming the caches for 10-12 million on the 2 servers for a third time / pass, and then will try the next config change reading up to 12 mill [14:22:03] if that sounds good? [14:22:14] Will probably leave it at 12 for some hours then [14:23:01] yeah, sounds good [14:23:22] I will slowly keep increasing db1111's weight till it reaches the same as db1126 [14:23:33] lovely! [14:39:08] at Q 12 mill :) [14:39:23] sweet [14:43:02] might go to 15 mill as the next step [14:43:05] but not until later [14:43:16] * addshore is off, will be driving for the next 2 hours [14:43:42] let's leave it at 12M for today [14:44:27] ack! [14:44:40] thanks and drive safe! [21:07:14] 10DBA, 10Product-Infrastructure-Team-Backlog: DBA review for the PushNotifications extension - https://phabricator.wikimedia.org/T246716 (10Mholloway) [21:18:33] 10DBA, 10Product-Infrastructure-Team-Backlog: DBA review for the PushNotifications extension - https://phabricator.wikimedia.org/T246716 (10Mholloway)