[01:37:29] 10DBA, 10wikitech.wikimedia.org: Move databases for wikitech (labswiki) and labstestwiki to a main cluster section (s5?) - https://phabricator.wikimedia.org/T167973 (10Jdforrester-WMF) [01:37:41] 10DBA, 10wikitech.wikimedia.org: Move databases for wikitech (labswiki) and labstestwiki to a main cluster section (s5?) - https://phabricator.wikimedia.org/T167973 (10Jdforrester-WMF) [06:27:14] 10DBA: Possibly replace db1087 (s8) with db1127 (x1) due to disk space constrains - https://phabricator.wikimedia.org/T245107 (10Marostegui) 05Open→03Declined We are good for now after compressing all the tables: ` root@db1087:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/m... [06:27:16] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [06:41:11] 10DBA, 10Operations: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) 05Open→03Resolved Host fully repooled Thanks everyone! [06:50:27] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) [06:51:50] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) @jcrespo what do believe could be broken within the prometheus exporter? [07:09:23] good morning [07:09:29] o/ [07:09:49] did you get to read backlog here? [07:10:22] read yesterday's one, it is interesting [07:10:56] i was on it [07:12:17] let me summarize it in one image :-D https://usercontent.irccloud-cdn.com/file/7LwsGaAG/image.png [07:29:17] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10jcrespo) >>! In T242702#5925682, @Marostegui wrote: > @jcrespo what do believe could be broken within the prometheus exporter? I don't know exactly what, but I can see that Graphana (prometheus, really) reports mysql exporter... [07:32:14] may I suggest to switch labsdb hosts now to get that out of the way before peak times? (assuming no other blockers like sanitarium compression, or other work) [07:32:36] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) >>! In T242702#5925711, @jcrespo wrote: >>>! In T242702#5925682, @Marostegui wrote: >> @jcrespo what do believe could be broken within the prometheus exporter? > > I don't know exactly what, but I can see that Grap... [07:33:25] 10DBA: Remove deprecated status options from grafana in mariadb 10.4 - https://phabricator.wikimedia.org/T244696 (10jcrespo) [07:33:28] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10jcrespo) [07:35:21] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10jcrespo) While I agree, note after Grafana fixes to be done, the reported error is from prometheus collection, not just missing data points. [07:36:28] o/ morning [07:37:01] * addshore is going to start his cache warming etc again [07:57:15] jynus: Looking at SAL I still only have 1 host to warm currently? :) [07:58:06] I am doing db1087 for you, but I was wating if we were going to depool it first for some final maintenance [07:58:18] ack! [07:58:48] I'm going to cache warm db1126 and try going to 8 million this morning (this is the highest point we got to so far) [07:59:12] I would appreciate a ping when db1087 gets pooled just in case I miss it, so that I also start the cache warming process there [07:59:30] jynus: we can do db1087 yeah, give me 15 mins please [08:09:33] no rush :-D [08:10:02] ANother thought unrelated to everything else. All of these cross cutting outages that have happened basically due to wikidata replicas becoming overloaded for one reason or another over the past years, actually smells like having some divide between the replicas that wikidata.org requests and all clients request might make sense [08:11:12] "the replicas that wikidata.org requests and all clients" what do you mean? [08:12:28] so, all wmf sites that are wikidata clients (most of them) also make calls to s8 at some point, strictly read calls [08:13:00] I see, so domain separation [08:13:04] and if a bad code deploy of repo code to wikidatawiki ends up overloading its replicas, all sites end up having issues [08:13:06] yup [08:13:07] that is something that is already supported [08:13:28] with "load groups", set a tag and we can isolate certain queries on certain servers [08:13:37] interesting [08:13:40] infrastructure wise that is very easy [08:13:45] not sure code-wise [08:13:48] any idea about how that is accessible in the mw abstraction? if at all? [08:13:49] ack! [08:13:51] that is your domain [08:13:53] :D [08:14:04] when asking for a connection [08:14:24] is this a mysql thing? do you have a link to some docs? :) [08:14:26] you can do a dbw or a dbr (those are variable names, i cannot remember the function profile) [08:14:38] it is a mw abstraction [08:15:05] and if you do a dbr, you can specify an optional extra parameter called "load group" [08:15:16] Ahh yes so this would be "Query group(s)"# [08:15:29] we can define one specific for wikidata [08:15:36] so in theroy all client connections could specify the group "client" and they could be routed somewhere else [08:15:40] and distribute queries in a way that make sense [08:15:45] interesting....... I might file a ticket to investigate this [08:15:46] as long as it is doable in code [08:15:57] > as long as it is doable in code < everything is doable in code :D [08:15:58] the infra is ready for it [08:16:10] what I mean is that it is already in place [08:16:20] some separations we were told were not easy [08:16:39] because they depended on inner abstractions that couldn't easilly be sent down [08:16:42] we were told [08:17:17] this is actually relevant here because the thing that overloads is "main" traffic [08:17:33] which is strange because I don't think there is a lot of main/human traffic [08:17:50] while non-human traffic should be sent in code to 'api' load group [08:18:05] so there may be a lot of partinining there that is not done properly [08:18:44] this helps a lot with performance because certan traffic has distinctive query patterns, and that keeps different cache "hot" [08:19:33] I filed https://phabricator.wikimedia.org/T246415 [08:19:47] > this helps a lot with performance because certan traffic has distinctive query patterns, and that keeps different cache "hot" < yup I just realized that and added that to the ticket too [08:19:50] jynus: I am going to start db1087 in sync and all that process [08:19:55] as repo and client have very different query patterns [08:19:58] I will ping you to review the coordinates [08:20:03] wait wait [08:20:07] I pooled db1087 [08:20:14] needs depool [08:20:15] I know [08:20:17] ok [08:20:18] sorry [08:20:32] I didn't want to be responsible for an outage :-P [08:21:08] staying around ready to help when needed [08:21:43] I have no idea if wikibase code uses the api group much, I'll have to check [08:22:37] so please add that to the questions if you create a ticket- if there is some division that makes sense for both isolation and performance reasons [08:22:59] It seems like the easiest way to inject these group is just to have a facade infront of the DB layer that adds whatever group, and then this can all be done at service construction time [08:23:16] but it is important to talk to us in case of change, so we are ready for traffic patterns shifting [08:23:46] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10wikidata-tech-focus: Investigate a different db load groups for wikidata - https://phabricator.wikimedia.org/T246415 (10Addshore) [08:23:48] yup! [08:24:02] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10wikidata-tech-focus: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) [08:24:12] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10wikidata-tech-focus, 10User-Addshore: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) [08:24:26] added dba tag for now, feel free to move it into some distant column :) [08:25:21] so the "it is already done" is true for mw-inherited code (dumps, recentchanges) [08:25:39] but it is my belive not much for wikibase-specific code [08:27:03] I had a quick skim through and dont believe we have any really, probably do for dumps, but that is probably it [08:30:02] ha [08:30:06] jynus: stop slave 's8' ; reset slave 's8' all; change master 's8' to master_host='db1124.eqiad.wmnet', master_user='repl', master_password='x' ,master_port=3318, MASTER_SSL=1,master_log_pos=249431070,master_log_file='db1124-bin.003257'; start slave 's8'; [08:30:16] I don't even think you are doing that, addshore https://phabricator.wikimedia.org/T138208 [08:30:23] (which caused issues in the past) [08:30:26] looking, marostegui [08:32:03] this is to be run on labs* hosts, right? [08:32:06] yep [08:32:33] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10wikidata-tech-focus, 10User-Addshore: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) [08:32:40] thanks, added links between the tickets [08:33:03] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikidata-Trailblazing-Exploration, and 2 others: Investigate a different db load groups for wikidata / wikibase - https://phabricator.wikimedia.org/T246415 (10Addshore) [08:34:26] FYI, I'm rolling out the mariadb-10.3 updates from Buster 10.3, mostly libmariadb, but this will also update mariadb-backup on 1107 and 1114 from 10.3.18 -> 10.3.22, let me know if there's an issue [08:34:50] moritzm: should be fine, neither use mariadb 10.3 :) [08:35:21] is it me or labsdb hosts are not caught up? [08:35:26] ah, ok. I thought you were using mariadb-backup on Buster [08:35:58] oh, no sorry, I was looking at the wrong direction, ignore my last line [08:36:12] hehe, I was rechecking, I checked and they were up to date [08:36:35] I was thinking on the reverse change, one sec [08:36:56] labsdb advance because of pt-heartbeat anyway [08:36:57] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) Should be fixed now: PEBKAC :) [08:45:11] last commit I can see is UPDATE `wikidatawiki`.`wb_id_counters` [08:46:47] which maches the other server, too [08:46:57] let me double check the command and we are good to go [08:47:41] thanks! [08:50:54] ok with it [08:50:59] sorry for taking me more time [08:51:28] no, no, please take your time, that's the point on the reviewing! [08:51:33] it is a sensible operation [08:53:37] there should be no pt-heartbeat error this time [08:57:57] how did you fix db1107? [08:58:48] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10jcrespo) [08:59:24] I was playing with /root/.my.cnf when I deployed the host and broke it [08:59:28] And didn't notice [09:00:14] lol wat? but that shouldn't affect prometheus [09:00:35] yes, when starting: FATA[0000] failed reading ini file: open /root/.my.cnf: no such file or directory source="mysqld_exporter.go:213" [09:00:46] oh, I didn't know it read it [09:00:54] will need to check [09:00:57] yep [09:01:07] so, ok with labs? [09:01:25] yes, move one first [09:01:33] yes [09:01:47] stop replication on the others and restart it everywhere else [09:01:55] I know [09:02:05] * addshore steps out for ~20 mins [09:02:07] :-) [09:02:16] I was asking if you were ok with the command [09:02:22] yep [09:04:06] labsdb1009 catching up nicely [09:04:15] cool [09:04:45] waiting for it to be fully in sync [09:04:48] before moving others [09:07:58] ugh, there is 2 labsdb1009 at https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&fullscreen&panelId=12&from=1582870060280&to=1582880860280&var-dc=eqiad%20prometheus%2Fops [09:08:09] I must have broken something [09:08:34] uh? [09:08:37] what's that? [09:09:22] maybe cause it had to s8 channels or something cached? [09:09:32] like the one from codfw master and the one from eqiad? [09:09:51] no [09:10:00] I've seen there is 2 files on prometheus [09:10:03] Do you want me to proceed with labsdb1010? [09:10:32] yes [09:10:35] mysql-labsdb_eqiad.yaml [09:10:39] aaaah [09:10:42] and mysql-labs_eqiad.yaml [09:10:45] but I don't know why [09:12:24] they have the same content except for: https://phabricator.wikimedia.org/P10552 [09:13:35] I am going to delete them, run the script, see what happens [09:15:51] only labsdb got regenerated [09:16:09] so maybe things got renamed at some point [09:16:50] that's so weird [09:17:25] I may have renamed or the port changed [09:17:37] but an unpuppetized file be kept [09:17:59] and because they are labs hosts, we do not look them on certain graphs a lot, only we noticed now [09:20:06] sorry for the issues, my fault [09:21:06] not a big deal! [09:21:11] 1010 caught up [09:21:16] going to move 1011 and 1012 [09:21:39] great [09:23:12] I think there is also another bug [09:23:39] the file that was "active" is pointing to labsdb1009:19104, but the right port is 9104 [09:23:49] looks like a typo no? [09:23:52] probably [09:23:55] checking now [09:24:16] all labs moved [09:24:23] maybe someon at cloud saw it and correct it manually [09:24:23] going to repool db1087 with the weight you set yesterday [09:24:27] we can discuss its weight later [09:24:34] but the bug may be there still [09:27:44] yep, it belives that multisource hosts are like multiinstance [09:27:53] and adds 10000 to the port [09:37:07] So, I have left db1087 and db1101:3318 with their original weight and groups [09:37:30] We can probably give 50 more to db1101:3318 and db1099:3318 (from 150 to 200) [09:41:13] so know that i understand the issue [09:41:23] I also need to keep compressing two more hosts in s8, db1094 and db1104, which I was hoping to do next week [09:41:26] the key, aside from having one more server [09:41:32] is to "ignore" api servers [09:41:42] what do you mean? [09:42:01] they are barely used, and we are acostumed they have most of the load [09:42:17] so we should just mix api + main load for wikidata only [09:42:34] should we temporarily place db1114 into s8? [09:42:44] 1114? [09:42:49] is that the test host? [09:42:52] yep [09:42:55] would need provisioning [09:43:00] not a bad idea [09:43:04] yeah, but shouldn't take long [09:43:11] it may be buster [09:43:15] so would need reimage too [09:43:32] yeah, I was thinking about buster+10.4 as that's part also of the OKR of having one host per section running 10.4 [09:43:47] and a good test for complex queries [09:43:49] let me deploy this fix for this embarrasing bug [09:43:53] haha [09:44:03] and then I can hear you properly [09:44:08] cool [09:44:11] but everything you say seems good to me [09:44:39] I just want to tell you how I saw the issues yesterday [09:45:29] yeah, my reasoning is: 10.4 on enwiki has gone good, and I was planning to move to another section for testing 10.4, right now s8 has load issues, so we can test 10.4 there and check how it handles complex queries, so maybe placing db1114 there would get us those two things addressed (10.4 testing under another heavy environment + more capacity to split the load) [09:46:09] It might not be done today, but I can leave the host with 10.4 and the transfer from the s8 snapshot running so it can catch up during the weekend [10:00:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/575479 [10:00:12] back [10:00:35] I am ok with that, as long as it is only temporary until more capacity is added [10:00:46] of course :) [10:00:49] I will need the test hosts at some point :-D [10:00:53] yep [10:00:54] definitely [10:01:10] it has percona, you will need to recover it from the latest snapshot [10:01:36] no, I am going to fully reimage it [10:02:08] ok [10:02:16] let me see if there was something of value there [10:02:17] with buster and 10.4 [10:02:22] like a config or something [10:02:35] we should probably save my.cnf just in case [10:03:33] actually I was looking at that [10:03:43] I kept the my.percona.cnf with the ammended config [10:03:52] didn't bother to puppetize it just for a test [10:04:03] of course [10:04:05] but that woudl prevent working again [10:04:20] so you've got it? [10:04:23] will copy that and may comit it ,alongside the one for mysql [10:04:24] one sec [10:04:28] excelent [10:04:33] no worries, I am preparing the reimage patch [10:05:15] I have it now [10:05:38] there is also .debs, but I have those on dbtools [10:06:14] cool! [10:10:29] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/575482/ [10:11:38] I am not sure I want to commit it, but at least it is not only on 1 place: https://gerrit.wikimedia.org/r/c/operations/puppet/+/575483/1 [10:11:56] I would commit it [10:12:22] I am saying because it would need more work on a real scenario [10:12:28] no templating, etc. [10:12:34] yeah, but it is an starting point [10:12:40] how do you add an instance to etcd? [10:12:50] do you have to run a command? [10:13:01] no [10:13:02] just https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/575482/1/conftool-data/dbconfig-instance/instances.yaml [10:13:06] and run puppet there [10:13:15] and then you have to fill it out with dbctl instance dbXXXX [10:13:22] dbctl instance dbXXXX edit [10:13:57] so after deployed there it is not pooled automatically? it is just like added to the list of available hosts, right? [10:14:05] yeah [10:14:07] exactly [10:14:08] thanks [10:14:12] +1 [10:14:22] cheers [10:14:24] checked notifications will be disabled [10:14:33] :) [10:14:51] you know that way better than me, so not sure how useful I am with my review [10:15:07] but I will put you to the test to recover a snapshot :-D [10:15:09] always helpful, easy to miss stuff or make typos [10:15:10] haha [10:15:57] what I can do for help is update tendril and zarcillo [10:16:11] sounds good :) [10:16:14] although I may wait until actual power off [10:16:19] yeah [10:19:18] I think this is not happening today [10:19:21] Error: Unable to establish IPMI v2 / RMCP+ session [10:19:43] let me try to reseat it locally [10:21:06] ha [10:21:17] I think actually db1114 may not be a good idea [10:21:32] I think we switched it now that I remember because of hw problems [10:21:38] did we? [10:21:41] yeah [10:21:46] :( [10:21:49] I didn't remember until you saw me the error [10:22:08] and it crashed frequently, so we switched it with db1118 or something [10:22:11] let me get the tasks [10:22:18] mmm that sounds familiar [10:22:48] Then, the only hosts we have are the MCR ones [10:23:00] Which are most likely not being used at all, but they are not "ours" [10:24:00] who technical owns the MCR ones? :P [10:24:20] marostegui: T214720 T229452 maybe others [10:24:20] T229452: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 [10:24:21] T214720: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 [10:24:44] addshore: I can check my original email thread, but I think Daniel K was one of them [10:25:04] we can search another [10:25:29] and the idea would be that you mihgt want to use one of them for s8? :P I'm pretty sure if it is daniel K / sdoc people "own" them I could convince them to give them to s8 ;) [10:26:10] addshore: yeah temporarily yes, until we decide a longer term solution (which so far looks like buying one, if we have the capacity to do so) [10:27:00] that was test-s4, right? [10:27:18] yep [10:27:57] let's check the last time that had activity, if it was long time ago, we could have a better argument [10:28:53] or we could even just install s8 alongside it [10:28:54] addshore: From my original emails, looks like Daniel, Katie and Amanda [10:29:16] 1.8T available, it may fit [10:29:23] Ack, Katie isn't around any more :) I guess the last thing those servers were actually used for was the prep of wb_terms stuff :P [10:29:32] addshore: I believe so yeah [10:29:52] If its Daniel and Amanda that need convincing, I would consider them already convinced ;) / I could ask them later [10:29:59] let's ask, there may be data valuuable [10:30:01] etc [10:31:07] the data on db1112 (the slave) is the same as db1111 (The master), so we can rebuild db1112 once we are fully done with it [10:31:12] from db1111 and will have the same data [10:31:46] metrics for labs didn't reappear (but the host is still on the list) checking [10:33:37] puppet also needs update [10:33:46] so requires followup [10:35:02] addshore: do you want me to send them an email? [10:35:15] marostegui: sounds good to me, cc me ? :) [10:35:29] yep, you and amir [10:35:59] just to be clear, I didn't remember the db1114 issue [10:36:05] I mean, someone from the structured data team of wmf sent me a box of boxes of chocolates the other day, so I think I'm in their good books :P [10:36:08] as it happened some time ago [10:36:16] when are you sharing those! [10:36:17] but a bulb when on when you said it had ipmi issues [10:36:25] its how it starts though, first chocolates, next db servers [10:42:09] hopefully this should be it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/575487 [10:58:23] 10DBA: Check why compare.py doesn't work with Percona 8.0 - https://phabricator.wikimedia.org/T243265 (10jcrespo) [10:58:25] 10DBA, 10Operations: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10jcrespo) [10:58:35] 10DBA, 10Wikimedia-Incident: Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10jcrespo) [10:58:37] 10DBA, 10Patch-For-Review, 10User-Banyek, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10jcrespo) [10:58:39] 10DBA, 10Operations, 10observability: Display lag on grafana (prometheus) and dbtree from pt-heartbeat instead (or in addition) of Seconds_Behind_Master - https://phabricator.wikimedia.org/T141968 (10jcrespo) [10:58:41] 10DBA, 10Operations, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10jcrespo) [10:58:45] 10DBA, 10Release-Engineering-Team-TODO, 10Epic, 10Release-Engineering-Team (Deployment services): Implement a system to automatically deploy schema changes without needing DBA intervention - https://phabricator.wikimedia.org/T121857 (10jcrespo) [10:58:48] 10DBA, 10Operations, 10Privacy Engineering, 10Traffic, and 4 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10jcrespo) [13:28:56] 10DBA, 10Upstream: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 (10Marostegui) >>! In T245489#5906183, @Marostegui wrote: > Created https://jira.mariadb.org/browse/MDEV-21794 Update: Looks like the bug's fix is now pending code review: http://lists.askmonty... [13:58:17] can I try to reset db1114 mgmt? [13:58:29] worst case scenario, db1114 goes down [14:01:14] done, didn't need a reset [14:01:26] I did a reset [14:01:29] From the idrac [14:01:34] (I said so on the task) [14:02:02] I am still waiting for it to come back [14:02:21] It is back! [14:03:48] Even though it is fixed, I am not sure I want to use this host anyways in production [14:04:11] yep, it needed a password change [14:05:59] So from what I am seeing, this host got the mainboard replaced [14:06:03] after all the issues [14:06:11] Maybe it is trustable? [14:06:51] cannot say, up to you [14:07:25] we've had enough party on s8 already, to place a non 100% trustable host [14:07:33] I am going to wait for Amanda to come back and use db1112 instead I think [14:08:02] We won't have either db1114 or db1112 ready today anyways [14:08:16] what I would do for the weeked [14:08:26] is reduce db1126 load [14:08:43] check cpu graphs, even fridays, which are low load, it reached 87% cpu [14:08:59] db1087 can take more load, and maybe others [14:09:08] yeah, db1087 needs more [14:09:24] we should try to keep all below 50% [14:09:34] independently of weights and groups [14:09:36] yeah, db1126 is peaking 90% [14:10:15] at this moment is when I would like to have clusters by section [14:10:22] so we could compare cpu [14:10:28] maybe we can create a custom one [14:10:42] have a look at weights for s8 [14:10:49] I have increased db1087 right now [14:10:52] while I try to create a custom dashboard [14:11:37] db1101 and db1099 can probably take a bit more too [14:11:52] going to wait a bit to let traffic shift [14:11:55] before changing more [14:12:53] o/ /me is back [14:13:14] * addshore reads up [14:18:35] db1126 liked that change and it is decreasing its CPU usage [14:18:45] And it is moving back to around 40% [14:18:53] which is healthier, although it has some spikes [14:18:56] let's see how it goes [14:19:24] addshore: I don't think we should be moving more Q today, let's leave the traffic settle a bit so we can better observe how the hosts perform during the weekend [14:19:40] addshore: Unfortunately we are not having the MCR host ready by today (waiting for Amanda's green light) to borrow it [14:20:09] Sorry to block you, but let's play safe for the weekend [14:20:52] marostegui: okay with me :) [14:21:09] I hope early next week we can get that temporarily new host [14:21:10] I'll come back on monday and poke you all again! [14:21:14] that'll give us some more room [14:36:31] so I have https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation for now [14:36:41] I will now put the % on the right [14:37:09] nice [14:37:17] that'll help a lot with the traffic shifting [14:38:02] I am not great with design, once I finish the editing will ask for help to make it pretty [14:53:31] I've added everything I wanted, made Y axis the same on all: https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation [14:55:56] oooh nice [14:59:27] I was playing with folders on grafana, should I create one for all db stuff? [14:59:37] e.g. "Databases" [15:00:01] I was actually creating one dashboard about masters QPS, and creating an alert for myself [15:00:14] I created an alert group named DBAs but only added my email (don't want to spam you with my tests for now) [15:00:22] So, yeah, let's create a DAtabase folder indeed [15:00:40] as far as I understood, it is a mere logical classification [15:00:46] shouldn't touch anything else [15:01:31] oh shit I have a meeting [15:11:18] language [15:21:03] 10DBA, 10Patch-For-Review: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) db1107 has been performing fine during the whole week, so I am not removing it from production and will leave it serving traffic during the weekend. [15:21:25] addshore: we got green light for db1112, so hopefully we'll have it ready by early next week [15:23:25] 10DBA: Move db1112 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) [15:24:19] 10DBA: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) [15:24:38] 10DBA: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) p:05Triage→03High [15:25:28] 10DBA: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) ` root@db1077.eqiad.wmnet[(none)]> stop slave; show slave status\G Query OK, 0 rows affected, 1 warning (0.00 sec) *************************** 1. row *************************** Slave_IO_State:... [15:36:37] marostegui: woo! [15:40:48] 10DBA, 10Patch-For-Review: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10Marostegui) Confirming db1111 has nothing before re-imaging: ` root@db1111:~# mysql -e "show processlist ; show slave status\G" +---------+-----------------+-------------------+--------------------+--------... [15:40:59] 10DBA, 10Patch-For-Review: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1111.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202002281540_marosteg... [15:57:56] 10DBA, 10Patch-For-Review: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1111.eqiad.wmnet'] ` and were **ALL** successful. [16:01:12] I am around, but wanted to leave early, last days have been long... [16:01:23] Go! [16:01:26] Have a good weekend :* [16:01:31] I will leave in a bit at as well [16:01:45] Going to start the reimage of db1111 and leave [16:04:03] so will not provision it for now? [16:04:35] I reimaged it with stretch by mistake [16:04:39] so going to do it with buster [16:05:05] go leave for the weekend! [16:12:02]