[07:13:40] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3109079 (10Marostegui) 05Open>03Resolved It is all good now, thank you Chris! ``` root@db1070:~# megacli -PDRbld -ShowProg -PhysDrv [32:10] -aALL Device(Encl-32 Slot-10) is not in rebuild process Exi... [07:29:22] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU broken - slave lagging - https://phabricator.wikimedia.org/T160731#3109087 (10Marostegui) [07:36:37] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109103 (10Marostegui) [07:42:25] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109104 (10Marostegui) I have manually forced a BBU learn cycle and it is now looking fine: ``` root@db1048:~# megacli -AdpBbuCmd -BbuLearn -aALL -NoLog Adapter 0: BBU Learn Succ... [07:42:31] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109105 (10Marostegui) 05Open>03Resolved a:03Marostegui [07:53:52] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3109112 (10Marostegui) db2065 and dbstore2001 are done: ``` root@neodymium:~# mysql --skip-ssl -hdb2065.codfw.wmnet commons... [07:54:48] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109087 (10jcrespo) Do you think we should force a learning cycle to db1047 T159266 ? [07:54:49] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3109117 (10Marostegui) dbstore2001 and db2065 are done: ``` root@ne... [07:55:40] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109118 (10Marostegui) I just tried - we will see! [07:56:38] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109119 (10Marostegui) But db1047 one has a different (and more worrying error) for BBU a1: ``` Battery State: Failed ``` [08:19:15] 10DBA, 06Labs, 10MediaWiki-extensions-Babel: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3109132 (10jcrespo) I've checked and babel table and it is being replicated to labs, just not exposed (needs view changes). I would suggest to labs team to ask for the ok from legal and/o... [08:31:05] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109156 (10Marostegui) db1047's BBU is acting weirdly It goes from Failed -> Charging -> Failed It is acting very weirdly, it has gone from ``` Relative State of Charge: 4 % Charge... [08:49:11] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3109174 (10Marostegui) [08:49:13] 10DBA, 13Patch-For-Review: s5: db1070 not using file per table - https://phabricator.wikimedia.org/T157931#3109171 (10Marostegui) 05Open>03Resolved db1070 has been up for 24h now without any issues and receiving production traffic, so considering this resolved. [08:50:14] 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2360296 (10Marostegui) Next week I will backup db1082, db1087 and db1092, reimage and reclone them from db1070 as it is now file per table (T157931) [09:05:46] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3109196 (10Marostegui) And it is not only being recreated, but used as of today: ``` root@EVENTLOGGING m4[log]> select t... [09:22:44] 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3109219 (10Marostegui) Thanks for the list of tables. From the DBA side, this would be the only thing to execute really (assuming you just want to add the "_ 15423246" to... [09:25:59] marostegui, jynus when you're around I'd like to review one command from the DC switchover wiki of last year [09:26:26] ok [09:26:49] https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_3_-_lock_down_database_masters.2C_cache_wipes [09:27:28] yes? [09:27:42] I don't remember why the 2 and not, instead of a mysql_group = core [09:27:50] given that the one in the wiki includes labsdb1005.eqiad.wmnet [09:28:03] (maybe it didn't at the time, I cannot remember) [09:28:46] that may work now, probably [09:28:54] labsdb1005 wasn't ok at the time [09:29:40] in shards that must be s1-s7, x1, es2 and es3 [09:29:41] * marostegui reading [09:30:06] maybe that is easier (although that could change in the future) [09:30:49] so can you confirm that labsdb1005 should NOT be included in the selection, right? [09:30:51] also "--defaults-file=.my.cnf" has to go away and --skip-ssl has to be introduced [09:30:58] it shouldn't [09:31:00] ok [09:31:03] there is not labs master [09:31:13] and no labs in codfw [09:31:39] yes, no labs master on codfw [09:31:56] ok, thanks for the info! [09:33:12] I think G@mysql_group:core doesn't include x1 [09:33:26] oh, it does [09:33:53] yep [09:33:59] just checked [09:34:02] maybe it didn't at the time [09:34:41] I have updated https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_3_-_lock_down_database_masters.2C_cache_wipes [09:34:46] I think I added them for s1-s7 for the switchover, probably didn't for x1, really cannot remember [09:34:50] ok, thanks [09:36:05] I have do delete the puppet stuff on dbs, too [09:37:17] yes, that I know [09:42:36] I have edited to the latest version [09:43:42] there is a worse problem (more theoretical, but I think important for automatization)- the scripts use puppet as the source of truth [09:44:40] well, puppetization of salt graints- which could be even more different than mediawiki's actual state [09:45:26] I am not saying it is a problem for this switchover-but it definitely shouldn't stay like that in the future [09:46:38] of course! also because if we end up deprecating salt we will remove the grains ;) [09:46:59] are you going to put that using cumin? [09:47:14] "put that"? [09:47:29] aren't you doing a switch dc script? [09:47:46] yes, it will be one of the tasks [09:47:54] and yes, using cumin [09:47:59] as a library [09:48:01] ok, maybe I am confused [09:48:18] why did you ask that- just as a check or to write that script? [09:48:50] because I'm writing the task and wanted to be sure the selection was correct [09:49:31] so a couple of extra comments, if that is useful [09:50:22] we could get rid of "Deploy mediawiki-config with all shards set to read-only" [09:50:22] sure [09:50:37] how? [09:50:38] in the latest deployed mediawiki [09:50:50] <_joe_> that would be great [09:50:51] because it does that if it detects masters as read only [09:51:04] without throwing tons of errors? [09:51:05] _joe_, I told you about that numerous times [09:51:14] :-P [09:51:17] <_joe_> jynus: but ^^ what volans said [09:51:25] that is a great question [09:51:34] <_joe_> won't that cause lag/errors? [09:51:41] <_joe_> uhm well we could see in codfw [09:51:44] lag no, because preciselly [09:51:54] no writes, no lag [09:51:57] <_joe_> they store that data in apc, do they? [09:52:09] <_joe_> no i mean user lag [09:52:14] (pt-heartbeat always writes no matter read only) [09:52:16] <_joe_> before mw decides it's readonly? [09:52:20] as a failsafe [09:52:39] <_joe_> or does mw test if the master is readonly on *every* request? [09:52:39] now, if it is done "well" and it caches the errors [09:52:47] I would ask the implementer (aaron) [09:52:51] <_joe_> exactly :) [09:52:53] and to be fair [09:52:56] do a proper test [09:53:05] <_joe_> oh, if you can understand his answer, that's great [09:53:07] not because I do not trust him [09:53:22] but because I think it has never been tested large-scale [09:53:30] a full, large shard [09:53:39] _joe_, :-) [09:54:32] lol [09:54:53] we could test on codfw [09:55:06] deploy one shard on read-write [09:55:16] try to do edits [09:55:26] <_joe_> yes [09:55:36] <_joe_> can you guys work on that? [09:55:44] on one side- edits are not that frequent [09:55:47] <_joe_> me and volans are quite overwhelmed with stuff to do already [09:55:53] sure [09:56:13] <_joe_> well you can make a curl call to codfw's appservers to test it [09:56:18] dont count on it working [09:56:34] as in, 100% sure we will not have to deploy the patch [10:00:40] 10DBA, 06Labs, 10Labs-Infrastructure: ug_expiry column of the user_groups table is not present on Labs - https://phabricator.wikimedia.org/T160686#3109297 (10Marostegui) Just to clarify: Moved it in our internal DBA dashboard to the "not db team" as this is normally handled by Labs. [10:36:15] on neodymium and sarin we have both wmf-mariadb101-client and the mariadb client packages from jessie installed (mariadb-client, mariadb-client-10.0, mariadb-client-core-10.1), shall we remove the jessie ones? I have no idea why they're installed, can't find something in puppet and also no package reverse dependencies on these hosts [10:36:50] yes, I said I was going to delete those [10:36:57] great, thanjs [10:37:23] then I'll skip the upgrades for those (since they were updated in jessie to .30) [10:43:43] done [10:46:16] thx [10:51:56] 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109357 (10Marostegui) Unfortunately, db1047's BBU looks totally broken, it is not making any sense in what it reports. Some places it says it is fully charged, some others don't,... [10:57:31] 10DBA, 10Gerrit, 06Operations, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3109383 (10Paladox) Upstream gerrit has added in built support for the MariaDB connector. So we won't need it pac... [10:58:56] jynus: why did you removed a phase in the DC switchover wiki? that's part will go away eventually but for now if it's staying has to be in a different phase than the RO/RW on the DBs [10:59:09] all the steps in a phase can (and probably will) be run in parallel [10:59:58] <_joe_> volans: we might need to reassess that [11:00:50] yes but if we need the commit in mediawiki-config has to be done after the DBs are in RW, not in parallel [11:09:22] why? [11:10:42] because if we deploy first mediawiki-config as RW (assuming the fix we talked above is still not fully working) we will start getting all kind of errors in MW [11:10:59] the 2 changes have to be done in sequence [11:11:14] I disagree, but feel free to change it [11:11:24] otherwise why phase 2 and 3 are separated? [11:11:28] it's the same thing [11:11:39] yes [11:11:44] they should be merged [11:11:52] the difference [11:12:03] is that deploying a change takes 2 minutes [11:12:10] to mediawiki [11:12:20] while setting read-write/read only is instant [11:12:39] I don't care if one takes 1s and the other 1m, if they need to be logically in sequence why putting them in parallel [11:12:50] they do not need to be logically in sequence [11:13:00] if they take more time, errors are a natural thing to happen [11:13:25] it means db change took too much time, and errors should happen [11:13:48] and again, that is assuming the automatic read-only mode doesn't work [11:14:01] yes, I'm assuming that of course [11:14:08] otherwise one step is just removed [11:14:31] also, the step was badly named [11:14:40] it said "failover masters" [11:14:51] so I deleted such a step, there is no master failover [11:15:13] feel free to add a new step, but it should have a proper name [11:16:53] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 07Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010#3109422 (10Marostegui) a:05Marostegui>03None [11:17:02] that's the wiki of last year, the name is the last thing. Why do you think that they don't need to be logically in sequence? [11:17:39] you first tell MW to put a nice message that the site is RO and then put RO the DB, otherwise users will get errors/retries and log will get errors [11:17:54] no logical problems happen when that is done out of order [11:18:15] order = sequence [11:18:43] well, then I would break down the other phases [11:18:54] I wouldn't wipw caches on phase 3 [11:19:01] until masters are read-only [11:19:54] sure, agree [11:20:10] again, it is a wiki [11:20:12] just edit [11:20:19] I saw a wrong phase [11:20:28] I corrected it by mergeing it [11:20:39] I will not revert if you split it further [11:20:50] ok, don't worry [11:21:25] it said phase 6 - database master swap [11:21:36] that is not a phase [11:23:27] regarding " new site's read-only master/slaves are caught up", any suggestions on the best way to ensure that? [11:23:36] querying the heartbeat table? [11:23:40] no [11:24:10] checking the binary position of the old master [11:24:25] then doing gtid_pos_wait on the other master [11:24:29] or similar [11:25:34] ok [11:30:31] jynus: also, nothing has to be done for pt-heartbeat right? It's already active/active from both masters [11:31:05] yes [11:31:11] great [11:31:20] the variable has to change on mediawiki [11:31:44] in fact, nothing technically should be needed for databases [11:31:57] but we do not trust mediawiki not to write in read-only mode [11:32:02] yeah [11:38:03] even if it looks like I am doing nothing- I am exporting and doing row checks like crazy right now for T154485 [11:38:04] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [11:40:21] that may show some craxy spikes one: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s2&var-role=master [11:49:54] 10DBA: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3107862 (10Marostegui) You currently don't have DROP privilege, which would give you TRUNCATE grants too. This is what you have for testreduce_vd ``` GRANT SELECT, INSERT, UPDATE, DELETE, ALTER ON `testreduce_vd`.*... [12:31:13] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3109518 (10jcrespo) Unless anyone else says so, I will reimage the old server on Monday. Last chance to check data and functionality works on the ne... [12:47:10] 10DBA, 06Labs, 10Labs-Infrastructure: ug_expiry column of the user_groups table is not present on Labs - https://phabricator.wikimedia.org/T160686#3109542 (10chasemp) 05Open>03Resolved should be good to go, let me know if not [13:25:26] 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3109616 (10Marostegui) dbstore2002 and db2058 are done: ``` root@ne... [13:26:11] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3109619 (10Marostegui) dbstore2002 and db2058 are done: ``` root@neodymium:~# for i in db2058 dbstore2002; do echo $i; mysq... [14:20:15] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3109759 (10Halfak) OK. Time to try to ping the larger set of people who have databases here. Here's the databases that match Phab users: * @dartar (dartar) * @drdee (diederi... [14:21:47] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109766 (10Marostegui) 05Open>03stalled Let's block this as db1047 might be decommissioned soon as per: T156844 [14:32:39] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3109809 (10Ottomata) Wow, old DB! decleramaul, nimish and rfaulk we can certainly get rid of. [14:55:16] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109863 (10Cmjohnson) @marostegui There are a few decom db's now I could swap out the bbu if you like or just proceed with the decom process. Let me know your prefe... [14:58:01] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109864 (10Marostegui) Hey @Cmjohnson! let's wait to see if that ticket keeps progressing for now, if the server is going to get decommissioned it would be just a was... [14:59:22] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109866 (10jcrespo) @Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it is not going to happen per months as it looks.... [14:59:28] 10DBA: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109867 (10ssastry) 05Open>03declined I can work with that. I was mostly checking in case you guys had thoughts about this. [15:02:19] 10DBA: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109872 (10jcrespo) 05declined>03Open a:03jcrespo I think we could give truncate grants on that database to ssastry. He demonstrated a responsible usage of the database- we shouldn't put burdens to it if it ma... [15:09:35] jynus: what do you meant before with doing gtid_pos_wait on the other master? [15:09:50] volans, let me show you code [15:09:56] it is difficult to explain in prose [15:09:59] sure [15:10:12] let me show you how mediawiki does it and you can be inspired or copy it, etc. [15:10:35] volans, one sec that I finish something [15:10:38] <_joe_> look, we're a couple of dumb developers [15:10:45] <_joe_> just paste us the sql query [15:10:47] ? [15:10:54] <_joe_> and we will blindly use it in our code :P [15:10:56] _joe_, I do not know by heart [15:11:05] that is why I have to search it [15:11:15] <_joe_> jynus: j/k ofc [15:11:22] and I trust mediawiki code than my hacks [15:11:25] *more [15:11:33] ok, otherwise I can get the pos on the old and check that exec on the new is >= of that [15:12:32] includes/libs/rdbms/database/DatabaseMysqlBase.php [15:12:40] SELECT MASTER_GTID_WAIT($gtidArg, $timeout) [15:12:58] so that waits up to $timeout seconds [15:13:23] and returns a value depending if it waits succesfully or it timeouts [15:13:27] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109921 (10Marostegui) >>! In T159266#3109866, @jcrespo wrote: > @Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it i... [15:13:44] the value to wait for is "SHOW GLOBAL VARIABLES LIKE 'gtid_binlog_pos'" [15:13:55] I can give you the whole line [15:14:01] if you give me 5 minutes [15:14:10] to finish 1 thing I am about to finish [15:14:20] probably that's enough, thanks, finish your thing, no hurry [15:14:27] I want to help [15:14:32] I just want to finish this first [15:14:50] I may even be able to send you a CR [15:14:59] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109930 (10Marostegui) >>! In T160691#3109872, @jcrespo wrote: > I think we could give truncate grants on that database to ssastry. He demonstrated a responsible usage of the database- we shoul... [15:15:17] marostegui, "I had the impression that for db1047 it was going to be pretty fast" [15:15:23] :-D [15:15:47] I will believe it when I see it [15:16:13] XDDD [15:16:18] and I am not critizising anyone- we wanted labs gone a year ago- and that is with us as blockers [15:16:19] I said from the comments! :) [15:16:32] imagine with us, analytics and reasearch at the same time... [15:20:23] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109945 (10jcrespo) @ssastry check if the above patch is what you want^. Of course, with great power comes great responsibility :-) - which means if you drop a table (or the script does it for... [15:20:31] ok [15:20:35] I am with you now, volans [15:21:49] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109947 (10Ottomata) Given the responses so far, I think we will be able to decom it soon. But, we should wait a while (maybe a week) to collect more feedback to be... [15:21:50] thanks, but I get the point [15:22:03] have you considered integrating WMFMAriaDB.py on your scripts? [15:22:06] I get the gtid pos on the eqiad masters [15:22:20] it will be easier if python talks to python [15:22:35] rather than going through an shell output [15:22:47] specially to check 10 masters [15:23:34] https://mariadb.com/kb/en/mariadb/master_gtid_wait/ [15:23:53] If the wait completes without a timeout, 0 is returned. [15:24:03] If the timeout expires before the specified GTID position is reached, then the function returns -1 [15:24:12] yep, already tested [15:27:00] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109954 (10Marostegui) Let's wait then to see how that ticket progress next week or so in order not to make Chris to replace it and then a few days later decom the se... [15:27:48] <_joe_> jynus: so actually while I think it is useful in general, in this specific case we want to parallelize connections to all masters [15:28:05] <_joe_> doing so in python requires more time than parsing a bash output in python [15:28:15] <_joe_> as cumin can already execute everything in parallel [15:28:15] what? [15:28:26] yes, I am not saying to use cumin [15:28:31] *not use [15:28:43] ah, I get you [15:28:52] cuming does not have "native mysql execution" [15:28:58] <_joe_> yes [15:29:09] <_joe_> because parallelization is done by the transport [15:29:11] buah ha ha, I am thinking badly [15:29:20] something that volans will not like [15:29:27] which is integrating it :-) [15:29:33] <_joe_> cumin can just execute remote commands [15:29:38] but not now [15:29:49] <_joe_> actually, you can just create your own library and integrate it in single cumin tasks [15:29:50] we can add a mysql transport if we see fit [15:29:58] at some point ;) [15:30:00] it is just it soulnd in saome cases a waste [15:30:08] <_joe_> we'll show you when we have something more serious [15:30:08] to connect localy to run mysql [15:30:26] when we could just run mysql- not in this case [15:30:34] <_joe_> yup [15:30:35] for things like imports [15:31:05] or to pass complex data messages [15:31:19] anyway, you know I have more ideas than time to commit to them [15:32:00] :) [15:47:11] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110035 (10ssastry) Thanks. That works! Agreed about power and responsibility. :-) [15:50:37] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110043 (10jcrespo) @ssastry sorry- I may not have been 100% clear: Question- does your data need backup, and which of your databases need it? We provide that service, but we need to know which... [15:55:27] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110054 (10ssastry) Ah, okay! We don't need backups of these databases. These are primarily used for testing and losing the contents is not catastrophic. But, if we decide otherwise in our team... [15:56:54] 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3110057 (10Nuria) ping @Jdforrester-WMF looks like you need to remove the instrumentation that is sending events. [16:02:37] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110074 (10jcrespo) Thanks, I will deploy the change, please test that it works for you when done. [16:53:43] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3110246 (10jcrespo) [16:53:47] 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Xtools: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 - https://phabricator.wikimedia.org/T133321#3110244 (10jcrespo) 05Resolved>03Open [17:00:22] 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Xtools: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 - https://phabricator.wikimedia.org/T133321#3110269 (10Matthewrbowker) The cleanup job has been running successfully. I ran it manually, here is the output. ``` 16:54 [xtoo... [17:01:14] 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#3110275 (10jcrespo) [17:02:39] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3110283 (10jcrespo) [17:02:42] 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#2228158 (10jcrespo) 05Resolved>03Open labsdb1003 is now constraineed, and one of your databases have >100GB in space. They look like simple copies of produc... [17:11:13] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3110330 (10MaxSem) The new server works for me. The upgrade also resolved T145599. Thank you! [17:18:55] 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3110371 (10jcrespo) [17:19:00] 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Xtools: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 - https://phabricator.wikimedia.org/T133321#3110369 (10jcrespo) 05Open>03Resolved > What is the difference between labsdb1001 and labsdb1003? Does labsdb1001 correlate to... [17:29:37] 10DBA: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3110389 (10Marostegui) Finished running pt-table-checksum on frwiki. Differences found on: ``` Differences on db1030 frwiki.archive ``` ``` Differences on dbstore1002 frwiki.archive frwiki.page_props frwiki.wbc_entity_usage ``` [18:49:30] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110583 (10jcrespo) Please test the changes and see if it works. [18:59:20] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3110616 (10GLCiampaglia) I am pretty sure my tables can be safely deleted. Thanks for the heads up! Giovanni [19:33:57] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110658 (10ssastry) Thanks! A test is running right now (and will probably take another day to finish), so I won't be able to test this right now. But, will try it out after the test run finishes. [21:11:47] 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3111007 (10ssastry) 05Open>03Resolved I had to terminate the test for disk space reasons. I reduced the test corpus size .. and long story short .. i was able to run truncate successfully. [22:58:25] 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3111341 (10Tbayer) I have been using `db1047` quite frequently for EventLogging queries as an alternative to `dbstore1002`, either because it was (at times) much faster, or in...