[02:34:53] 10DBA, 10MediaWiki-Page-derived-data, 10Performance-Team (Radar), 10Schema-change: Avoid MySQL's ENUM type, which makes keyset pagination difficult - https://phabricator.wikimedia.org/T119173 (10Krinkle) p:05Triage→03Medium [05:29:33] 10DBA, 10Operations: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) 05Open→03Resolved This was done. We started a bit later than expected due to some on-going issues with another service. RO started: 05:20:59 RO finished: 05:23:34 [05:29:36] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:30:02] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:30:22] 10DBA, 10Operations: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) [05:31:48] 3000% cpu usage [05:33:18] there is a lot of "delete from processlist where server_id = @server_id" [05:33:24] I don't think that is normal [05:33:52] is that comingo from the watchdog o rwhat? [05:34:19] I think all hosts are trying to be updated at the same time [05:34:25] which is not great for performance [05:34:34] I would give it a few minutes [05:34:38] and check the status again [05:35:18] hopefully the events don't pile up after the first run [05:35:36] I think it will stop working, Act it is already increasing for all of them [05:36:36] Errr [05:37:11] root@db1115:~# host 10.64.32.13 [05:37:12] 13.32.64.10.in-addr.arpa domain name pointer orespoolcounter1002.eqiad.wmnet. [05:37:52] We do we have that ip issuing a delete? [05:38:27] Or mwmaint1002 running delete from processlist where server_id = @server_id as root? [05:39:27] I don't think thats right, must be a dns issue [05:41:07] it is also pointing to cumin1001, I think that is just missleading [05:42:31] CREATE PROCEDURE [05:42:40] so looks like 81166 root 10.64.32.25 tendril Connect 1745 Copying to tmp table insert into processlist_query_log\n (server_id, stamp, id, user, host, db, time, info)\n 0.000 is holding everything from running smoothly? [05:43:03] it is like running a lot of setup host processes [05:46:37] cannot say, honestly [05:46:50] lots of inserting and CREATE * things running [05:46:55] Going to stop the event_scheduler [05:47:13] to see what cleans up and what keeps running [05:47:42] activity back to 0% [05:47:52] so it is the scheduler what is causing high load [05:48:03] | 240574 | root | 10.64.32.25 | tendril | Connect | 172 | Copying to tmp table | insert into processlist_query_log [05:48:03] (server_id, stamp, id, user, host, db, time, info) [05:49:04] and q.stamp > now() - interval 7 day [05:49:37] 245 seconds and still didn't finish [05:51:42] this is me right now: https://youtu.be/12LLJFSBnS4?t=18 [05:51:54] hahahahahahahaha [05:52:22] I am trying to see wtf Copying to tmp table | insert into processlist_query_log [05:52:22] (server_id, stamp, id, user, host, db, time, info that is [05:52:24] and where it is coming from [05:52:34] the fact that it takes more than 5 minutes to finish is concerning [05:52:56] not sure if that might be causing the rest of things to pile up [05:54:18] maybe that table just exploded? [05:54:26] that table is huge, so... [05:54:33] that may be one of the maintenance processes that run from time to time [05:54:44] on a cron [05:55:26] to be honest, I would just truncate that table [05:55:36] (even if it is not the source of this problem) [05:55:39] sure [05:55:46] but you will have to kill the query first [05:55:51] yeah [05:56:38] https://phabricator.wikimedia.org/P11134 that's crazy XD [05:57:09] it is the query log, I think it is normal for it to be big [05:57:22] yeah I know, but we haven't purged it in years [05:57:34] it gets purged [05:57:56] a partition gets created and dropped, see partition time [05:58:28] yeah, but it is insane that it grows 16GB in one day [05:59:14] if you intend to kill, better sooner than later :-D [05:59:56] yeah, the join with processlist_query_log must be just crazy for it to finish [06:00:04] if the table is that huge [06:00:08] going to kill that insert [06:03:17] jynus: let's truncate the table? [06:03:30] ok to m1 [06:03:33] ok to me [06:03:43] ok, doing it and after that I will start the event scheduler [06:04:46] https://phabricator.wikimedia.org/P11135 [06:04:50] going to start event scheduler [06:05:15] ok, done [06:06:29] I am going to do a recap on phabricator [06:06:39] so all this is recorded for posterity [06:06:54] lol [06:10:06] things are looking good so far now [06:10:12] yep [06:10:41] for some reason the processlist got much larger laterly [06:11:20] 3GB vs 15 GB [06:11:22] are you comparing the backupsizes? [06:11:33] we don't backup tendril [06:11:37] only zarcillo [06:11:40] ah [06:11:46] not by choice [06:11:53] but we do backup tendril's schemas? [06:12:00] it doesn't work because of external locking [06:12:23] we have a backup of it on dbprov and the secondary host [06:12:31] yep [06:13:25] going to get some breakfast and will document the tendril's incident [06:13:30] on its phab task [06:13:36] I will go for a walk [06:13:42] enjoy :) [06:13:47] I will ask you about something non-work later [06:13:55] ok! [06:13:56] but related to databases :-D [06:33:59] something went wrong with es5 backup [06:34:03] will check it later [06:40:13] 10DBA, 10Privacy Engineering, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10jcrespo) a:03jcrespo [06:42:02] 10DBA, 10Privacy Engineering, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10jcrespo) Thanks, useful evaluation and notice, @JFishback_WMF, taking it from here to generate the exports without the problematic rows. [07:08:23] 10DBA, 10Operations, 10Patch-For-Review: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) We had an issue with tendril today where tendril was very slow and almost unresponsive, at first I thought it was another case of {T231769}, but it wasn't. First of a... [07:15:08] marostegui: following the discussion from yesterday, what hosts do we want to reimage to buster+10.4 for testing and load purposes? [07:16:06] kormat: we need to upgrade at least another host from es4 or es5 but in eqiad this time [07:17:48] heh. https://tendril.wikimedia.org/tree isn't loading for me [07:18:00] is it down again? [07:18:10] hosts and activity look ok [07:18:53] so that's a different issue then, as the DB host this time looks fine [07:18:55] let's investigate [07:20:52] `PHP Notice: Undefined offset: 1661 in /srv/dbtree/inc/tree.php on line 14` [07:21:30] $host = $this->hosts[$host_id]; [07:21:31] XD [07:23:22] I guess 1661 was a host ID [07:23:31] root@db1115.eqiad.wmnet[tendril]> select * from servers where id=1661; [07:23:31] Empty set (0.00 sec) [07:24:58] so everything but tree seems to be working [07:31:11] there is something weird with the data [07:31:23] 1661 is in global_Status_log, and slave_status [07:31:30] (i don't know if that's expected or not) [07:31:48] There is a duplicate host with the same id [07:31:51] let me fix that [07:31:56] and they both have 1660 as id [07:32:04] before you do - how can i see this? [07:32:22] ah, I was examining db1115 tendril database, on the servers table [07:32:34] so you've already hid the evidence? :) [07:32:38] and tendril's code runs on dbmonitor1001 [07:32:44] No, I haven't touched it yet [07:33:02] `select COUNT(*) from servers where id=1660;` gives me `1` [07:33:21] yes, same, but look at this: [07:33:33] | 1660 | es2020.codfw.wmnet | 3306 | 10.192.0.157 | 2020-05-05 07:22:53 | NULL | 2020-05-04 14:00:00 | 2020-05-05 06:30:00 | 2020-05-05 07:23:18 | 2020-05-05 07:22:39 | 2020-05-05 07:23:33 | 2020-05-05 07:23:25 | NULL | 180355229 | 171966665 | [07:33:33] | 1657 | db1107.eqiad.wmnet | 3306 | 10.64.0.214 | 2020-05-05 07:23:24 | NULL | 2020-05-05 07:00:00 | 2020-05-05 06:27:00 | 2020-05-05 07:23:09 | 2020-05-05 07:23:15 | 2020-05-05 07:23:32 | 2020-05-05 07:23:30 | NULL | NULL | NULL | [07:33:33] | 1660 | es2020.codfw.wmnet [07:34:01] wtf, it is gone now? [07:37:24] we can try to insert a dummy row with id=1661 [07:38:01] are there delete triggers (is that the right word?) to remove entries from other tables if a server row is deleted? [07:38:44] I guess so, but it is all a big of black magic yeah. But if a host is removed, the events are (or should be) removed as part of the tendril-drop script [07:39:13] you mean the glorious 1k-line bash script? :) [07:39:18] yes! [07:39:19] haha [07:39:32] oh, sorry, that's the add script. the drop script is quite simple [07:39:36] so that removes all the events from the host that is being disabled and dropped [07:41:02] something i've been wondering: should these glorious scripts be using a transaction? [07:43:38] so from what I can see it was working today around 6:30AM CEST [07:49:42] marostegui: this error might be a red-herring. it's present in even logs from 2020-04-16 [07:49:54] the offset one? [07:49:57] yep [07:50:27] Yeah, I was checking enabling php errors on tree.php and it wasn't very successful there [07:52:22] I am sure this is too much of a coincidence and it is most likely related to today's issues [07:52:45] `PHP Warning: gethostbyaddr(): Address is not a valid IPv4 or IPv6 address in /srv/tendril/lib/utility.php on line 507, referer: https://tendril.wikimedia.org/tree` [07:53:47] that could be because of my tests with a dummy hosts (if it is from a few minutes ago) [07:53:53] it is, yeah [07:54:14] so... so far we have zero useful debug info from tendril. "neat" [07:54:20] see what I say about killing? if we had invested all time time in maintaining in creating a new, simpler one, we would have 2 developments [07:55:01] hmm. there's a js error [07:55:09] `Error: Must call google.charts.load before google.charts.setOnLoadCallback` [07:55:48] uf, if api was updated, we are for a fun ride [07:56:50] "SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data Resource URL: https://tendril.wikimedia.org/jquery-1.9.1.min.js" [07:57:46] also "Using //@ to indicate sourceMappingURL pragmas is deprecated. Use //# instead" [07:58:38] https://developers.google.com/chart/interactive/docs/basic_load_libs#update-library-loader-code [07:58:50] so the api has changed, but i don't know when [07:59:29] buf :( [07:59:39] "To update your existing charts, you can replace the old library loader code with the new code" [07:59:42] i wonder if we can pin to the older version. i'll poke around. [07:59:44] cool [07:59:54] very useful documentation :-D [08:00:32] maybe it is a 4 code change? [08:01:26] v47 was released on 2020-01-06, previous version was 2018-10-01 [08:01:47] it should be a 1 code change, I am going to try it on server directly, ok? [08:02:08] *1 line of code [08:02:22] it's not :) [08:02:38] or at least not from what i can see [08:02:40] but sure, go for it [08:02:46] you can't break it any more than it's already broken :) [08:03:21] kormat: btw: https://phabricator.wikimedia.org/T96499 [08:03:30] yep :) [08:04:03] 'Module "current" is not supported.' [08:04:43] will just mimic the code snippet [08:04:54] maybe the api of the api changed too :-D [08:10:04] Am I doing something wrong? [08:10:15] ReferenceError: drawChart is not defined [08:12:43] sorry I had to restart my laptop [08:12:46] It got totally frozen [08:13:37] what are you currently doing jynus ? [08:13:38] I seem, I have to update google.setOnLoadCallback(drawChart); [08:13:41] *see [08:13:52] as soon as I fine where is that defined :-D [08:13:54] *find [08:14:31] only in 6 places :-D [08:15:02] haha [08:15:42] yep, tree is back [08:15:53] namespace changed [08:15:56] can you do a recap of what was needed? [08:16:02] yeah, yeah [08:16:10] I was waiting for the patch to explain [08:16:17] not that I wasn't [08:16:28] just it would be easier with a patch if you let me [08:16:32] this is not a fixc [08:16:39] this is just "a test for a proper fix" [08:17:19] summary: just google api changed not our fault this time [08:17:41] I am wondering if this could have had something to do with the early overloads or it is just pure coincidence? [08:17:49] coincidence [08:17:51] Even though we fixed the earlier issues by truncating the huge table [08:17:56] I belive people were having tree issues [08:18:13] maybe deprecation happened at different times due to google's cdn/dcs [08:18:19] for contex of the early issues kormat : https://phabricator.wikimedia.org/T231185#6107666 [08:18:20] it finally hit europe [08:18:37] so this is unrelated, this was just javascript [08:21:00] downloading patch and reviewing in a second [08:24:32] marostegui: https://gerrit.wikimedia.org/r/c/operations/software/tendril/+/594412 [08:24:56] it is not a complete patch, it needs to change all instances of google.setOnLoadCallback(drawChart); [08:25:38] but that is what it is on dbmonitor1001 right now, just for explanation [08:25:40] I see so essentially: https://developers.google.com/chart/interactive/docs/basic_load_libs#update-library-loader-code [08:25:46] Let's include that on the commit message? [08:25:48] the url I mean [08:25:53] yeah, but our usage is not that trivial [08:25:58] yeah [08:26:16] because the function is available on some pages [08:26:23] so cannot be on the header [08:26:31] I don't even know how that worked before [08:26:41] I guess it never did- it just errored all the time [08:27:00] yes [08:27:03] will add that [08:27:07] just wanted to share quickly [08:27:12] yep! [08:27:19] I have to finish the patch [08:27:26] maybe kormat can help me review it? [08:27:29] I hope they don't start changing more things or this will be fun [08:27:36] and also make another for dbtree [08:27:41] which it is also broken [08:27:57] yeah, as they are separate files :-/ [08:28:12] more than files, repos :-D [08:29:38] let me take a quick break [08:30:46] i think we should pin to a specific version [08:31:14] we can replace `'current'` with `'47'`, for example [08:32:59] can that be done actually? [08:33:18] yep, documented here: https://developers.google.com/chart/interactive/docs/basic_load_libs#load-version-name-or-number [08:33:44] marostegui: so, going back to the start of all this, how about i reimage es1024 (from es5)? [08:33:53] oh, that maybe a better approach - I am scared if they started changing stuff more often, before we've gotten rid of tendril [08:34:01] yep, exactly [08:34:44] kormat: es1024 sounds good [08:35:12] alrighty. i'll do that now. then i'll start looking at partman vomit. [08:35:33] XD [08:36:05] kormat: regarding the suggestion about the version, I would suggest we discuss on jaime's patch [08:36:09] so it doesn't get lost on irc [08:36:34] +1 [08:37:19] done [08:37:56] thank you! [08:58:57] you can check updated patch, that should be more "presentable" [08:59:04] I will check dbtree now [09:00:29] I wonder if https://www.google.com/jsapi is still needed [09:01:04] I can try dropping it on prod, see if something else breaks [09:02:29] yeah, I think it is not needed [09:08:15] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) >>! In T251188#6105539, @Tchanders wrote: > @jcrespo @Marostegui - thanks for pinging AHT. This would... [09:10:57] jynus: another pass done [09:11:31] my bad [09:12:18] but please check patch 3 (soon 4) [09:13:07] i thought i did, but :gerrit: [09:16:13] he [09:16:34] has anyone noticed that gerrit is pretty terrible? [09:16:36] it changes color when on an outdated version [09:16:41] it is super clear! [09:16:55] I wonder if you are using polygerrit or old gerrit [09:17:13] polygerrit [09:18:27] let me know if ok to deploy [09:18:38] and yes, that last change is a bit risky [09:18:48] but this is tendril we are talking about [09:19:45] LGTM'd [09:20:01] different SLA require different approaches :-D [09:20:21] SAL of tendril right now is "50% of the times works" [09:20:24] *SLA [09:20:42] I've also uploaded https://gerrit.wikimedia.org/r/c/operations/software/dbtree/+/594422 [09:21:09] which explains the "dbtree doesn't work" reports we got laterly from 1 person [09:21:22] I will update the jquery version there, too [09:21:26] LGTM'd too [09:21:55] one strange thing of this repo [09:22:00] is the deploy is manual [09:22:24] one has to go to dbmonitor host and rebase [09:22:52] we could do it automatically on merge, but I decided not to, but I am no longer in charge :-D [09:24:53] i'm currently looking at the db case in netboot.cfg, and i'm wondering how exact we want the patterns to be [09:25:14] because some of them cover a lot more hosts than we have, and some are very very exact, and i'm not sure what the rationale is [09:25:31] kormat: basically we cover more db* hosts [09:25:35] so we don't have to worry about new ones [09:25:44] the exact ones is because we don't have many of them anyways [09:25:49] and we don't usually buy those [09:25:54] but db* hosts, we usually buy [09:26:01] ie: es hosts, we rarely do, or pc [09:26:53] is it me or has the graphical representation changed, too: https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1 [09:27:04] seems... different [09:27:13] yeah [09:27:16] it is different [09:28:24] poke around and report if you see something broken... that wasn't broken befor ofc [09:28:39] marostegui: hmm. the other question is - what happens if you pxe boot with partman/custom/no-srv-format.cfg? i know it causes partitioning to fail, but is that meaningfully different from the partitioner waiting for human input? [09:29:03] yeah, it errors out and the install cannot complete at all [09:29:06] yes, it won't find the root system [09:29:22] this was mostly an accidental finding, but we ended up thinking of it as a feature [09:29:31] i might not be asking this question correctly [09:29:36] so even an op couldn't manuualy operate it [09:29:37] :-D [09:29:50] ah. and that's a desired feature? [09:30:00] well, depends on the alternative [09:30:01] because if it isn't, we could drop the entire case [09:30:11] in which case all db hosts will pause looking for manual input [09:30:15] ideally, the install would not even start [09:30:17] no automatic data loss [09:30:20] 10DBA, 10Operations, 10Privacy Engineering, 10Traffic, and 4 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Marostegui) For the record: https://gerrit.wikimedia.org/r/#/c/operations/software/tendril/+/594412/ https://gerrit.wikimedia.org/r/#... [09:30:31] sure. but i think that's a bigger thing to tackle, and is tracked by the task you filed [09:30:31] or the host wouln't even reboot [09:30:36] ah, ok [09:30:50] so for the /srv thing, the ideal is to just do an install [09:31:01] coplete wipe of / but keeping /src [09:31:16] /srv, is that the scope you are working with, or something else? [09:31:36] sorry, I may be missunderstanding the contex, ignore me if that's the case [09:31:37] let me back up slightly: at some point in the (hopefully near) future we'll have some flag somewhere we can toggle to say this host should reimage from pxe [09:31:45] yep [09:31:58] in the meantime, we want to manually use netboot.cfg to say if a host should reimage or not [09:32:12] but if enabled, it shoud keep /srv at all times... except if it is a new host [09:32:18] so we should have 2 recipes [09:32:24] if we remove the no-srv-format case, all db hosts will default to waiting for human input before destructive actions [09:32:25] "reimage keeping /src" [09:32:33] and "reimage fully (new hosts only)" [09:32:36] and we don't need to maintain this pattern of hosts [09:33:09] I personally do not like that [09:33:27] I think it should even fail complately or go though completely [09:33:44] what does it do if it fails completely? [09:33:50] does it reboot? [09:33:54] not touch disk at all [09:34:07] kormat: right now, if it fails, it keeps waiting for an human to reboot it [09:34:08] prevent reboot if possible, but not sure that is possible [09:34:09] or does it also wait for human input? [09:34:15] marostegui: right [09:34:25] we have to count that reboot will always be possible [09:34:34] but we should eliminate the human factor too [09:34:35] so from a safety point of view it's equivalent [09:35:03] well, the current system requires 2 people to ok a reimage, a deployer and a reviewer [09:35:19] I think that is a feature, let me give you an example [09:35:40] jynus: whereas with what i'm talking about someone could reboot a machine off pxe, and do an install manually? is that the case you're concerned about? [09:35:51] a person wants to reimage db1101, types db1001 which has super-important data, and manually reimages db1001 [09:36:53] there should be some kind of conscient decision [09:37:01] I am getting a bit lost, I think we are having now two very similar threads, but I am not following any of them [09:37:04] that can be reviewed [09:37:07] he he [09:37:13] sorry, moving away [09:37:20] jynus: when i fix the partman recipe, there shouldn't be any more manual interaction with reimaging [09:37:38] ok [09:37:48] (fresh installs are maybe an exception? don't know there) [09:37:51] kormat: can you expose what you have in mind (or questions) and once done, we can ask you questions or discuss ideas? [09:37:59] yes [09:38:27] sometimes mock patches work too, to start a discussion [09:39:09] marostegui: sure. let me quickly put a description together (probably in a paste) [09:39:20] sounds good [09:39:24] thank you [09:39:44] I think that'll help to have a clear view of what you have in mind (at least to me) [09:39:52] and then we can discuss ideas and ask questions :) [09:43:09] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Tchanders) @Marostegui Thanks for looking into those wikis - even though we know it only affects a few, it's help... [09:46:42] marostegui, jynus : https://phabricator.wikimedia.org/P11138 [09:47:25] kormat: thanks - reading [09:47:27] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) >>! In T251188#6107915, @Tchanders wrote: > @Marostegui Thanks for looking into those wikis - even th... [09:49:17] kormat: so by default any host will fail to install unless told otherwise? [09:49:30] yes, same as currently [09:49:49] and if specified, it will do a full reimage including /srv? [09:50:22] marostegui: we will presumably have srv-format.cfg and no-srv-format.cfg (or similar). we'd use no-srv-format for reimages, [09:50:33] and srv-format for fresh installs, or where we want to wipe /srv for whatever reason [09:51:22] Right, so by cases: 1) by default fail on the install 2) have an specific partman for full reimage including /srv 3) have an specific partman to reimage without including /srv? [09:51:23] is it me or https://dbtree.wikimedia.org/ is now blue? [09:51:36] marostegui: exactly [09:51:38] :-/ [09:51:40] jynus: yes [09:51:43] he he [09:51:48] why on on tendril? [09:52:00] is it on the app, or is it a default? [09:52:20] marostegui: though technically 1) would be "block in the partitioner for human input" rather than an explicit fail [09:52:23] maybe it was always blue and we just "fixed it" [09:52:34] kormat: right [09:53:15] kormat: so if you want to reimage a host, you basically add a line with the partman recipe you want (either full or avoid /srv) one, right? [09:53:17] ideally T251416 would get fixed, and it would never even make it into the installer [09:53:19] T251416: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 [09:53:22] I still don't understand fully [09:53:28] marostegui: yes [09:53:30] "Drop the case entry from netboot.cfg. This will cause all db hosts to block on human input by default." [09:53:58] does this mean have no recipe by default? [09:54:11] yes [09:54:53] problem is foundations will not like that [09:55:10] "all hosts should have the recipe to reimage them" [09:55:32] that sounds like a foundations problem :P [09:55:41] no, it is our problem :-D [09:55:45] anyway, ignore that [09:55:47] for now [09:55:54] I think it is ok-ish [09:56:04] "ok-ish" \o/ [09:56:05] I am not convinced it is better than a hard failure [09:56:14] I think it is sane [09:56:28] I just have to be convinced that it is better than other alternatives [09:56:43] specially on edge cases [09:56:58] "I forgot to remove the change after reimage" etc. [09:57:17] jynus: that stuff should be handled by netbox or whatever [09:57:24] oh, I agree [09:57:29] so this is short-term? [09:57:38] you should have a dropdown to select the next single-boot target [09:57:39] you should have started there [09:57:54] jynus: well, i hope so. but i've no idea how long it will take for T251416 to be fixed (if ever) [09:58:00] yeah, ignore that [09:58:12] because without that, we're left with this [09:58:13] I mean this is "as long as ^is not fixed", right? [09:58:18] yep [09:58:26] ok, that makes me more open to change [09:58:43] specially if it is safe and make people more productive [09:59:15] my hope is to have a cookbook for this, which will require you to specify fresh install or reimage (and ask if you're really really sure), [09:59:28] there is some things to discuss [09:59:35] it would then poke the flag in netbox, drain the machine, do the reimage, etc etc [09:59:43] (that's a far-away target) [09:59:44] which is coordination [09:59:57] now a days a change helps communicate "I am reimaging X" [10:00:04] would that work now too? [10:00:16] I guess so because it would be added [10:00:26] jynus: with my proposal? yes. [10:00:26] it won't change I think [10:00:30] however [10:00:49] it depends on having a working (real) /srv-no-format [10:01:05] right? [10:01:19] or do you intend to do that change first? [10:01:19] jynus: it's not a hard requirement, but ideally yes [10:01:40] so the downside [10:01:47] I am just discussin, eh? [10:01:47] the CR that fixes srv-no-format needs to change netboot.cfg to not use it [10:02:02] there is still room for manual error and no 2 person check [10:02:33] jynus: someone has to make the CR, someone else has to review it; is that insufficient? [10:02:35] until a long term fix is done [10:03:02] ok, so you assume that when you want to reimage, nobody will try to do it manually from now on [10:03:08] which is a fair assumption [10:03:25] (another reason i'd love to have this automated - i forgot to revert the change that made es2025 reimagable yesterday :/) [10:03:49] sure, but the same thing would happen here, only worse, right? [10:03:55] you forgot to revert that change [10:04:05] but it would still need manual imput [10:04:17] which I know it is not ideal [10:04:18] jynus: sure. that issue is only solvable with a fully-automated pipeline [10:04:32] or, no. it's solvable if you can set a one-time boot target [10:04:42] well, we already have that [10:04:50] the problem is, it sometimes fails [10:05:01] (i guess a boot target that disables pxe boot in the bios would technically be one-time ;) [10:05:17] no, I mean that wmf-auto-reimage does a one time net boot [10:05:26] it is an ipmi command [10:05:33] jynus: from the machine-side, right? gotcha [10:05:39] i'm talking about from the pxe server side [10:05:40] but that is not the problem to solve [10:05:42] yeah [10:05:59] so in general, and this is my opinion, manuel is the one you have to convince [10:06:21] I am not amused, but I am not against it, as long as it improve someone's workflow [10:06:41] and all people agree on not doing manual reimages [10:07:28] I think we need to try to narrow scopes a bit. This solution would be already better than what we have and will reduce errors for sure, specially with all the regex and all that. [10:07:49] It gives us the same safety margin we have at the moment (no less, which is what could be worrying) [10:08:07] I would love T251416 to be solved, but that won't happen in 6 months [10:08:08] T251416: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 [10:08:13] And I think this is a step towards it [10:08:48] we'll still have the same issues as we have now (if we forget to remove the host from the file and all that) [10:08:53] actually, as long as we don't have a /no-srv-reimage, I think this is worse [10:09:11] I think for me that is a blocker [10:09:14] But if I understood correctly, by default the host will mimic that same behaviour [10:09:21] jynus: that's fine from my pov [10:09:30] jynus: i'm happy to block this change on fixing the partman recipe [10:09:31] because you will be using the same recipe [10:09:41] for reimages and for "blocking reimages" [10:09:45] and tha leads to human error [10:09:53] but the important thing is that if i fix the partman recipe (and don't change the name), then everything _will_ reimage by default :) [10:09:55] oh, I connected to db1001 [10:10:01] that is ok [10:10:10] I think the imporatnt part is the separation [10:10:15] that is very much not ok :) [10:10:34] anyway, i think we have enough of a consensus to move forward [10:10:46] jynus, marostegui : thanks for your input, it's very appreciated :) [10:10:59] kormat: but that is T251416 not on our scope [10:11:46] 10DBA: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10Kormat) N.B.: the case in netboot.cfg **must** be changed in the same CR that fixes no-srv-format.cfg, as otherwise all db hosts will reimage by default. [10:12:00] "**must** be changed in the same CR" [10:12:01] ^ added a comment to the partman fixing task [10:12:05] no [10:12:13] I only say to be a blocker [10:12:22] jynus: did you read the justification? [10:12:26] so there is a different workflow [10:12:47] then you didn't understood what I meant [10:13:21] What are we discussing now? [10:13:27] marostegui: unclear [10:13:36] all db hosts will reimage by default is ok [10:13:49] jynus: why is that ok? [10:13:54] jynus: that's not ok to me [10:14:34] what I ask is different workflow between "manually handle partitioning (error)" and "automated /srv keeping" [10:15:15] If I read again: https://phabricator.wikimedia.org/P11138 it says that by default hosts will wait (like they do now) by default, which is OK to me [10:15:30] that is what I call "reimage by default" [10:15:40] start the reimage, but not complete it :-D [10:15:42] I am not following I think [10:16:20] I am ok with https://phabricator.wikimedia.org/P11138 [10:16:22] "reimage by default" to me implies that reimaging completes [10:16:23] as written [10:16:24] so you want the host not to go into the installer by default if started with PXE? [10:16:30] kormat: yeah, same [10:16:40] but that includes "To reimage a host you create a specific case with the (fixed) no-srv-format partman recipe, which will make the installer run to completion." [10:17:06] aka "fixing no-srv-format", ofc not by default [10:17:22] jynus: I think that means if you want a host to reimage without /srv, you need to set an specific partman recipe to it [10:17:23] so all or nothing, I was just saying doesn't have to be the same CR [10:17:29] yes [10:17:33] and I am ok with taht [10:17:42] Ok, so that's clear too [10:17:47] but for me that is a dependency [10:17:55] Dependency on what? [10:18:11] on taking out hosts from netboot.cnf [10:18:34] the 2 checkboxoes there :-D [10:18:35] as in: removing the lines? [10:18:43] yes, that is what he is proposing [10:18:55] yeah, but what do you mean with a dependency? [10:19:06] do both checkboxes at the same time [10:19:09] aka [10:19:17] fix partman AND remove the hosts [10:19:21] not just remove the hosts [10:19:33] to prevent mistakes [10:19:34] you mean: if we remove that line the host should not complete the reimage in case it is done by mistake? [10:20:12] what he is proposing is: [10:20:25] remove hosts from netboot- that will make reimage stop at partitioning [10:20:41] fix no-srv-reimage so it reimages automatically [10:21:15] for reimages, no-srv-reimage will have to be added to netboot.cfg [10:21:15] yeah, I understand the proposal, what I want to understand is what is the dependency or blocker you are mentioning :) [10:21:48] not removing the hosts from netboot until T251768 is resolved [10:21:49] T251768: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 [10:22:02] because if it is done, we will get acostumed to reimage manually [10:22:31] we should do things full manually or fully automatic, so we change the workflow at the same time [10:22:47] not remove and "manually partition" [10:22:55] jynus: that's what my update to T251768 means [10:23:02] kormat: I got you [10:23:09] i think we're violently agreeing :) [10:23:12] I only clarified the part of the CR [10:23:20] not necesarilly the same :-D [10:23:25] but you got me I think [10:23:40] now we need manuel, which is the one you must convince :-D [10:23:54] but I am the one that I am not explaining myself [10:23:55] I am convinced of that too, so that's why I asked what were we discussing [10:23:57] ok [10:24:02] nothing then :-D [10:24:04] :-P [10:24:07] Because I thought we all were on the same page [10:24:08] going in circles [10:24:10] :-DDDD [10:24:11] i think the difference is: [10:24:18] I was not convinced at first [10:24:20] to be fair [10:24:29] jynus was saying we shouldn't remove the entry from netboot.cfg until >= partman fix [10:24:34] exactly [10:24:37] i'm saying we should do it at exactly the same time [10:24:50] but that is essentially the same [10:24:53] ha ha [10:24:56] yes [10:25:04] the only difference is if we do it in 2 steps, in between db hosts can auto-reimage [10:25:13] but the before and after states are identical [10:25:16] I don't like 2 steps, error prone [10:25:25] "ups I reimaged the wrong host" [10:25:31] but that is my opinion [10:25:41] kormat: but https://phabricator.wikimedia.org/T251768#6107956 states it won't be 2 steps [10:25:41] i was talking about the other order, [10:25:48] partman fix first, then netboot change after [10:26:09] marostegui: yes :) i'm just trying to explain why we had the above discussion [10:26:12] marostegui: which he added after my suggestion, so got me on board :-D [10:26:13] right [10:26:17] So many meta things already [10:26:20] anyway, we're all good :) [10:26:30] there is one last thing, which is external agreement, [10:26:44] so expect resistence [10:26:50] will only be able to help as much [10:27:01] that's fine, i can live with that [10:27:14] lets stop discussin [10:27:16] anyone who objects can take it up with T251416 :) [10:27:16] T251416: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 [10:27:22] kormat: +1 [10:27:48] so I am guilty of always thinking about the big picture [10:28:06] jynus: we're a lot alike on that :) [10:28:06] that is sometimes good, sometimes creates bikeshedding [10:28:26] so feel free to slap me and say "narrow scope" [10:28:44] note also the typical phabricator reader has very low attention span [10:29:01] and we sometimes are very casual phabricator readers [10:29:23] marostegui: https://gerrit.wikimedia.org/r/c/operations/puppet/+/594446 for you [10:33:09] marostegui: ty. also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594449 [10:35:31] uh, something weird happened [10:35:39] i don't think i merged that [10:36:06] you normally need to wait for CI to verify the change or force it (not nice) [10:36:41] right - but i don't think i did anything, [10:36:44] from gerrit's output you force a rebase? [10:36:45] and gerrit says it's changed [10:37:05] Kormat [10:37:06] 12:34 PM [10:37:06] ↩ [10:37:06] Uploaded patch set 2: Patch Set 1 was rebased. [10:37:09] i am confuse [10:37:50] CommitDate: 2020-05-05 12:32:24 +0200 [10:37:52] vs [10:37:55] oh crud [10:37:57] CommitDate: 2020-05-05 12:34:18 +020 [10:38:20] ok. i think i simply got confused between my 2 gerrit CRs. like a _pro_. [10:39:18] so if everyone can simply forget the last 4 minutes happened, that'd be great! :) [10:39:32] (no? damn) [10:39:34] XD [10:40:34] Publicly logged: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-databases/ [10:40:54] 404 \o/ [10:41:33] it has changed: http://bots.wmflabs.org/logs/%23wikimedia-databases/ [10:41:51] I was about to do the same [10:41:52] damn you [10:42:09] kormat: you are welcome [10:42:19] * kormat plots revenge [10:42:23] there is also http://bots.wmflabs.org/browser/index.php?display=%23wikimedia-databases [10:42:36] more friendly for old dates [10:44:02] also, BTW now we are reimaging individual servers [10:44:37] when a mass reimage is needed, I used to reimage many at the same time (or at least enable them, never all technically at the same time) [10:44:43] which makes the process less tedious [10:45:11] e.g. "all db108X servers" [10:45:47] ack. i might do that at some point, but so far i'm having a little bit of difficulty keeping track when i'm doing more than one at a time. there's maybe 40 steps in the process at the moment [10:45:52] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [10:53:28] marostegui: is it ok to directly depool es1024? [10:54:48] for es hosts it is important to have at least 2 replicas all the time [10:55:11] I think that would mean giving weight to the master other than 0 [10:55:11] kormat: you can depool es1024 but give weight on the master, maybe just 50 [10:55:34] so that it also serves some of the read load? [10:55:35] for context, nothing would break immediately [10:55:42] kormat: yep [10:55:57] but if there is connection errors to the only replica, we logged some errors in the past [10:56:00] kormat: in case the other slave gets some overload, things will also be able to reach the master [10:56:02] ok. and there's only 3 hosts in equad. gotcha. [10:56:15] yeah, only applies to those [10:56:37] also in general, be gentler, those are big servers with hds [10:56:43] so need more love :-D [10:56:57] almost all others are with ssds [10:57:21] alright :) [10:57:25] marostegui: please check the diff on cumin1001 [10:57:58] look ma, I am a frontend develper now: https://gerrit.wikimedia.org/r/c/operations/software/dbtree/+/594457 [10:58:02] kormat: checking [10:58:26] kormat: +1 [10:59:45] you get a +1, you get a +1 you get a +1 https://media1.giphy.com/media/xT0BKqB8KIOuqJemVW/200w.webp [11:11:40] there's a lot of wikiadmin entries in processlist: https://phabricator.wikimedia.org/P11142 [11:11:42] is that expected? [11:13:01] (that's es1024) [11:15:23] checking [11:15:35] right, the dumps [11:15:47] let's check with apergos if it is ok to reboot it [11:15:52] normally it is ok [11:15:56] but check with them [11:33:10] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [11:34:38] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [11:50:02] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['es1024.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202005051149... [11:51:54] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10Schema-change: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 (10Marostegui) @Ladsgroup to make this even more interesting, I just realised that db1105:3311 doesn't have any of the `tmp_` indexes. I am going... [11:53:05] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10Schema-change: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 (10Ladsgroup) >>! In T206103#6108179, @Marostegui wrote: > @Ladsgroup to make this even more interesting, I just realised that db1105:3311 doesn't... [12:13:22] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1024.eqiad.wmnet'] ` and were **ALL** successful. [12:46:58] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10Schema-change: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 (10Marostegui) I found a few queries only, and the query plans are very similar. The query time with and without the index almost doesn't change, o... [12:49:42] marostegui: es1024 is ready to resume its duty. is it ok to undo the weight/pooling changes from earlier in a single step? [12:50:06] kormat: no, take sometime I would suggest -p25, -p50, -p75 -p100 [12:50:11] and the master to be removed after that [12:50:18] the master's weight I mean [12:51:48] grand, on it. [12:51:53] thank you [13:04:14] nuke the master! [13:04:50] on it! [13:10:34] yo! [13:10:47] it's busy in here! [13:11:09] so let me show you the diagram, which will simplify much of the later discussion [13:11:41] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10MediaWiki-Special-pages: Optimize recentchanges queries - https://phabricator.wikimedia.org/T251885 (10Marostegui) [13:12:00] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10MediaWiki-Special-pages: Optimize recentchanges queries - https://phabricator.wikimedia.org/T251885 (10Marostegui) 05Open→03Stalled Stalling as we might get help with this in a few weeks. [13:12:02] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10Schema-change: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 (10Marostegui) [13:12:12] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10MediaWiki-Special-pages: Optimize recentchanges queries - https://phabricator.wikimedia.org/T251885 (10Marostegui) p:05Triage→03Medium [13:12:21] XioNoX: https://phab.wmfusercontent.org/file/data/6spjtaxe3sfs7nzkgjuf/PHID-FILE-3euio526kenfyricxcw3/backup_workflow.png [13:13:08] ok! [13:13:16] all of those are 10G hosts [13:13:37] and the idea is to cross-replicate all backups to the other site [13:13:41] in numbers [13:14:20] that's a bit scary :) [13:14:34] that is around 16TB each week [13:15:04] well, that is the part why we tell you [13:15:21] if it wasn't scary and it was 1 KB we wouldn't be having this conversation :-D [13:15:34] speak your mind or ask questions :-D [13:15:51] yeah, doing some maths [13:16:18] it is not homogeneus during the week [13:16:30] first, is it possible to rate limit the transfers? [13:16:32] but there is some control we can do about frequency, caps, etc. [13:17:18] yes, there it is "Maximum Bandwidth Per Job" option [13:17:30] now the question is how much that is needed [13:17:37] 1 job is 1 transfer from X to Y ? [13:17:39] as we have 1Gbit clients for the most part [13:17:45] and we have limited concurrency [13:18:03] I think current concurrency is 2 jobs per pooll [13:18:04] 1G or 10G? [13:18:08] jynus: have you considered bittorrent? /s [13:18:18] 10DBA, 10Growth-Team, 10MediaWiki-Recent-changes, 10Schema-change: recentchanges table indexes: tmp1, tmp2 and tmp3 - https://phabricator.wikimedia.org/T206103 (10Marostegui) @Ladsgroup right now this is the situation with both recentchanges hosts on `enwiki`. db1099 only has `tmp_3` index db1105 has neith... [13:18:20] most origin clients are 1G [13:18:30] because well, backups [13:18:36] so origin are dbprovxxxx ? [13:18:43] no, all backups [13:18:50] dbXXXX [13:18:56] gerrit, netbox [13:19:13] those write directly to the sd daemon "file storage" [13:19:15] so those are not on the diagrams [13:19:46] they write to the local DC backupxxx ? [13:20:01] so 2 separate things [13:20:13] database backups vs other backups [13:20:24] the digram is database backups only [13:20:46] ok, the scope here is intra DC or inter DC transfers? [13:20:50] regular backups would be the same, except instead of dbprov, it would be each configured service [13:21:09] I am guessing you are interested in inter [13:21:26] but I wonder if you worry about intra too because of that question? [13:22:15] Interested in both, but the codfw-eqiad link is the main bottleneck [13:22:19] ok [13:22:33] so let me summarize current state of non-db backups [13:22:45] network part mostly [13:23:08] backup daemons read data and send it encrypted to bacula storage [13:23:17] as the active one is backup1001 [13:23:28] so their local DC one, right? [13:23:33] that means that everybody, including codfw hosts, send it to backup1001 [13:23:36] ah [13:23:38] ok [13:23:40] (note current setup) [13:23:46] because you hit the point :-D [13:23:57] now, for redundancy [13:24:17] backup1001 send a copy of their data to backup2001 [13:24:27] like every week or so [13:24:41] you can check backup1001 for the input and output usage [13:25:05] most intensive this week as it is when full backups run [13:26:04] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=8&fullscreen&orgId=1&refresh=5m&var-server=backup1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&from=1586093159044&to=1588685159045 [13:26:27] backups are programmed to run at non-peak hours, so 4 am or so [13:26:59] so things that change now and in the future [13:27:16] interesting, the RX spikes are much higher than TX? [13:27:29] yeah [13:27:29] and more frequent [13:27:43] not everything is replicated [13:27:53] also different policies [13:27:57] etc [13:28:15] I am going to guess that the largest spikes are dbs [13:28:22] that are not replicated with bacula's method [13:28:34] but are more than 50% of data [13:28:40] ok [13:28:56] so dbs followed the same pattern so far [13:29:06] but the graph I showed you earlier [13:29:27] changes a bit the setup for I think better redundancy [13:29:37] backups are generated locally, in both dcs [13:29:47] but the plan is to send it to the other dc [13:29:55] makes sens [13:29:57] less efficient, but more resilient [13:30:23] maybe we will end up with bacula doing that too, not sure yet [13:30:31] rather than bacula's own replication [13:30:31] so host -> localbackup -> remotebackup [13:30:37] yep [13:30:43] that is the diagram above [13:30:54] also they need to be compreseed/postprocessed [13:31:02] so those are the intermediate dbprov [13:31:02] that sounds better to me than sending codfw to eqiad and then back to codfw :) [13:31:06] cool [13:31:08] yeah [13:31:22] to be fair, database backups is new [13:31:38] regular backups I tried to keep the same setup "if it is not broken, don't fix it" [13:31:51] so we will see with databases, and maybe later migrate backups to a similar model [13:32:06] backups here == non-db backups [13:32:07] so because we're adding DB backups we have to backup more things [13:32:09] in any caase [13:32:13] yes [13:32:16] cool [13:32:22] we just enable external store backups [13:32:33] those are much larger than old database backups [13:32:37] these are content backups [13:33:14] think old backups were only around 3TB logic + 6 snapshots [13:33:16] how fast do those cross DC transfer need to happen? Once a week? and if it takes many hours or more I guess it's fine? [13:33:24] new ones are 24 TB or so [13:34:13] so speed on backup is not a high importance [13:34:17] speed on recovery is [13:34:25] but of course [13:34:31] we are limited to the frequency [13:34:51] backups must happen before the new ones start :-D [13:35:01] haha of course [13:35:15] at the moment, the bacula part takes around 3 hours [13:35:25] not counting the external store [13:35:36] are they rate limited so far? [13:35:46] technically no [13:36:05] but jobs are not necesarilly fast and are limited in number of jobs at the same time [13:36:22] compression and encryption has to happen, etc. [13:36:49] I would like for you to look at past backup periods and tell me "this is worring" [13:36:57] or "put a cap of X MB/S" [13:37:01] or any guidance [13:37:13] or just keep an eye and tell me not to do stuff [13:37:27] specially now that we are going to go ino that 16TB/week of transfer [13:37:48] of course, escalonated (jobs are scheduled on different days, etc.) [13:38:22] for example, we could run one job of the ES hosts and I can abort or show you how to abort if it creates issues [13:38:29] and we can tune [13:38:38] not sure, tell me how you would go about this [13:39:00] so far there is nothing worrying [13:39:07] he [13:39:26] what is your theoretical worries? [13:39:36] *are [13:39:47] maximizing link between dcs? [13:39:52] do you have number to share? [13:40:02] if there is a possibility to not run all backups at the same time, spread it one after the other, and rate limit them to something like 3Gbps, it would be a good start [13:40:04] any tip? [13:40:17] " if there is a possibility to not run all backups at the same time" that already happens [13:40:33] that is something we want anyway for hd performance [13:40:44] worries would be any core link saturation (between routers or between switches [13:40:45] ) [13:40:52] uf 3Gbits? [13:40:57] we would never reach that [13:41:01] but we have alerts for that, and looking at graphs we're so far good [13:41:15] ok [13:41:29] so sata is 6Gbits [13:41:44] and these are HDs, we don't even reach that even theoretically [13:42:20] good :) [13:42:36] those hosts have lot of disk, but they are not DB-like hosts [13:42:49] ok, I will be pinging you on big milestones [13:43:08] but I think we should be very far from having network issues [13:43:42] feel free to also ping me if you see bad perf due to some of my hosts [13:43:59] for sure yeah [13:44:44] from regular hosts to backup hosts within a DC are they also run one after the other? [13:44:50] yes [13:44:54] I can show you graphs of that [13:45:07] but in those cases, backups is slow [13:45:09] I'm guessing at some point we will have to run some of them in parallel if there are too many hosts to backup, no? [13:45:13] and from 1G hosts [13:45:21] we have a paralelism of 2 [13:45:22] good point [13:45:34] again, disk (iops) is the issue [13:45:42] that works for me :) [13:45:56] lots of paralelism hurts disk performance, which is our biggest bottleneck [13:46:04] or in the case of logical backups, cpu [13:46:14] let me show you graphs anyway [13:46:26] sure [13:47:00] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=8&fullscreen&orgId=1&from=1588498469268&to=1588686392529&var-server=dbprov1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql [13:47:11] first spikes are a normal backup [13:47:29] the spaces in between are backup processing (cpu, not trasmission) [13:47:47] so paralelism of 2, but to different host even [13:48:11] ok! [13:48:19] yeah I don't see any network risk so far [13:48:30] I am not sure why the second is higher [13:48:30] within or between DCs [13:48:34] I was going to say bacula [13:48:41] but that shoudl be tx [13:48:57] note also the times [13:49:06] they are scheduled on purpose at night [13:49:12] this also for db load [13:49:21] that should help something too :-D [13:49:45] ok, as I said, when all backups are setup, will ping you again [13:50:11] as the content ones will be as large as all backups before, but we'll see [13:50:40] but I think we have different definition of "lot of network usage" 0:-D [13:51:04] there is 1 last question, a bit unrelated [13:51:54] I am sure there is some QoS on network, hard limits, etc., understandibly [13:52:27] I wonder how easy/useful/needed would be to disable some of these in case of a total disaster scenario [13:52:44] "all wikis are down and have to recover from backups" [13:53:23] feel free to tell me if my question is stupid [13:54:43] there is no QoS or limitations other than actual interface speeds [13:54:51] oh [13:55:12] ok, then I can see your worries, at least theoretically [13:55:14] :-D [13:55:36] but also answer my question :-D [13:55:54] if everything was down, we would have free highway to recover :-D [13:56:07] thanks, XioNoX that was very helpful [13:56:19] I hope it was somewhat interesting for you [13:56:35] to expose you to the backups world [13:56:40] at our scale QoS is a pain and the answer is usually to get bigger pipes [13:56:53] not complaining at all [13:56:56] :-D [13:57:00] yeah for sure, I understand better what's up with backups [13:57:13] thanks! [14:23:20] sadly, our little tendril troubles meant than some backups finished but couldn't report to the database as such [14:35:27] es2025.codfw.wmnet is giving me the same errors as labsdb1009 "broken table detected" [14:35:37] so probably some views having issues [14:36:30] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Jclark-ctr) a:03Jclark-ctr [14:40:20] the reason why it is having issues is that es2025 was upgraded to 10.4 [14:40:36] there is an extra grant, "DELETE HISTORY" [14:42:50] wow, that worked? [14:51:41] FYI, very interesting for those handling reimages: https://wikitech.wikimedia.org/w/index.php?title=MariaDB&type=revision&diff=1865129&oldid=1864948 [15:06:27] how does an extra grant breaks mydumper? [15:06:52] Ah I see the diff [15:06:54] Interesting [15:07:07] I will loop all our 10.4 and delete that grant tomorrow [15:25:09] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10Jdforrester-WMF) 05Open→03Resolved https://dumps.wikimedia.org/other/ -> https://dumps.wikimedia.org/other/codereview/20200428/ Thanks,... [15:29:53] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) Err, @Marostegui Are you sure P11137 is the right link to the paste? I am in WMF-NDA group but I don't... [15:32:56] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10jcrespo) @Niharika try now, access was very restricted before. I have made it a bit less (but still private). [15:34:12] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) >>! In T251188#6109123, @jcrespo wrote: > @Niharika try now, access was very restricted before. I have... [15:37:02] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10jcrespo) I've added you to the paste manually, if that doesn't work either, I am as confused as marostegui! [15:40:08] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) @jcrespo Hmm, doesn't work. {F31803604} [15:41:23] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10jcrespo) last try! Otherwise I will just leave it on production. [15:44:13] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) >>! In T251188#6109186, @jcrespo wrote: > last try! Otherwise I will just leave it on production. Got... [15:45:20] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) That's weird....I created the past with WMF-NDA as usual :-/ Did something change on the paste creati... [15:46:44] I don't get the paste issue, the WMF-NDA description says: Create a paste that is visible only to members of WMF-NDA, plus any subscribers you CC. [15:47:08] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10jcrespo) Just in case: ` root@mwmaint1002:/home/niharika29$ ls -lhsa P11137.txt 4.0K -rw-r--r-- 1 niharika29 wik... [15:48:14] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10jcrespo) [15:49:19] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) [15:49:26] jynus: can you see this? https://phabricator.wikimedia.org/P11154 [15:53:00] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) >>! In T251188#6109111, @Niharika wrote: > Err, @Marostegui Are you sure P11137 is... [15:55:13] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) >>! In T251188#6109302, @Marostegui wrote: >>>! In T251188#6109111, @Niharika wrote:... [15:57:06] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Marostegui) That would explain why you were not able to see it until you were CC'ed. Only WMF-N... [16:20:46] 10DBA, 10DC-Ops, 10Operations, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10colewhite) p:05Triage→03Medium [20:23:15] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (The Letter Song), 10Patch-For-Review: ipb_address_unique has an extra column in the code but not in production - https://phabricator.wikimedia.org/T251188 (10Niharika) With @SPoore's help we managed to clear all the duplicate blocks. CC @Tchanders @dma...