[07:13:40] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3109079 (10Marostegui) 05Open>03Resolved It is all good now, thank you Chris!  ``` root@db1070:~#  megacli -PDRbld -ShowProg -PhysDrv [32:10] -aALL  Device(Encl-32 Slot-10) is not in rebuild process  Exi...
[07:29:22] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU broken - slave lagging - https://phabricator.wikimedia.org/T160731#3109087 (10Marostegui)
[07:36:37] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109103 (10Marostegui)
[07:42:25] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109104 (10Marostegui) I have manually forced a BBU learn cycle and it is now looking fine: ``` root@db1048:~#  megacli -AdpBbuCmd -BbuLearn -aALL -NoLog  Adapter 0: BBU Learn Succ...
[07:42:31] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109105 (10Marostegui) 05Open>03Resolved a:03Marostegui
[07:53:52] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3109112 (10Marostegui) db2065 and dbstore2001 are done: ``` root@neodymium:~# mysql --skip-ssl -hdb2065.codfw.wmnet commons...
[07:54:48] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109087 (10jcrespo) Do you think we should force a learning cycle to db1047 T159266 ?
[07:54:49] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3109117 (10Marostegui) dbstore2001 and db2065 are done: ``` root@ne...
[07:55:40] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109118 (10Marostegui) I just tried - we will see!
[07:56:38] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109119 (10Marostegui) But db1047 one has a different (and more worrying error) for BBU a1: ``` Battery State: Failed ```
[08:19:15] <wikibugs_>	 10DBA, 06Labs, 10MediaWiki-extensions-Babel: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3109132 (10jcrespo) I've checked and babel table and it is being replicated to labs, just not exposed (needs view changes). I would suggest to labs team to ask for the ok from legal and/o...
[08:31:05] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109156 (10Marostegui) db1047's BBU is acting weirdly It goes from Failed -> Charging -> Failed It is acting very weirdly, it has gone from ``` Relative State of Charge: 4 % Charge...
[08:49:11] <wikibugs_>	 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#3109174 (10Marostegui)
[08:49:13] <wikibugs_>	 10DBA, 13Patch-For-Review: s5: db1070 not using file per table - https://phabricator.wikimedia.org/T157931#3109171 (10Marostegui) 05Open>03Resolved db1070 has been up for 24h now without any issues and receiving production traffic, so considering this resolved.
[08:50:14] <wikibugs_>	 10DBA: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191#2360296 (10Marostegui) Next week I will backup db1082, db1087 and db1092, reimage and reclone them from db1070 as it is now file per table (T157931)
[09:05:46] <wikibugs_>	 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3109196 (10Marostegui) And it is not only being recreated, but used as of today: ``` root@EVENTLOGGING m4[log]> select t...
[09:22:44] <wikibugs_>	 10DBA, 06Analytics-Kanban, 13Patch-For-Review: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454#3109219 (10Marostegui) Thanks for the list of tables. From the DBA side, this would be the only thing to execute really (assuming you just want to add the "_ 15423246" to...
[09:25:59] <volans>	 marostegui, jynus when you're around I'd like to review one command from the DC switchover wiki of last year
[09:26:26] <jynus>	 ok
[09:26:49] <volans>	 https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_3_-_lock_down_database_masters.2C_cache_wipes
[09:27:28] <jynus>	 yes?
[09:27:42] <volans>	 I don't remember why the 2 and not, instead of a mysql_group = core
[09:27:50] <volans>	 given that the one in the wiki includes labsdb1005.eqiad.wmnet
[09:28:03] <volans>	 (maybe it didn't at the time, I cannot remember)
[09:28:46] <jynus>	 that may work now, probably
[09:28:54] <jynus>	 labsdb1005 wasn't ok at the time
[09:29:40] <jynus>	 in shards that must be s1-s7, x1, es2 and es3
[09:29:41] * marostegui reading
[09:30:06] <jynus>	 maybe that is easier (although that could change in the future)
[09:30:49] <volans>	 so can you confirm that labsdb1005 should NOT be included in the selection, right?
[09:30:51] <jynus>	 also "--defaults-file=.my.cnf" has to go away and --skip-ssl has to be introduced
[09:30:58] <jynus>	 it shouldn't
[09:31:00] <volans>	 ok
[09:31:03] <jynus>	 there is not labs master
[09:31:13] <volans>	 and no labs in codfw
[09:31:39] <jynus>	 yes, no labs master on codfw
[09:31:56] <volans>	 ok, thanks for the info!
[09:33:12] <jynus>	 I think G@mysql_group:core doesn't include x1
[09:33:26] <jynus>	 oh, it does
[09:33:53] <volans>	 yep
[09:33:59] <volans>	 just checked
[09:34:02] <volans>	 maybe it didn't at the time
[09:34:41] <jynus>	 I have updated https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_3_-_lock_down_database_masters.2C_cache_wipes
[09:34:46] <volans>	 I think I added them for s1-s7 for the switchover, probably didn't for x1, really cannot remember
[09:34:50] <volans>	 ok, thanks
[09:36:05] <jynus>	 I have do delete the puppet stuff on dbs, too
[09:37:17] <volans>	 yes, that I know
[09:42:36] <jynus>	 I have edited to the latest version
[09:43:42] <jynus>	 there is a worse problem (more theoretical, but I think important for automatization)- the scripts use puppet as the source of truth
[09:44:40] <jynus>	 well, puppetization of salt graints- which could be even more different than mediawiki's actual state
[09:45:26] <jynus>	 I am not saying it is a problem for this switchover-but it definitely shouldn't stay like that in the future
[09:46:38] <volans>	 of course! also because if we end up deprecating salt we will remove the grains ;)
[09:46:59] <jynus>	 are you going to put that using cumin?
[09:47:14] <volans>	 "put that"?
[09:47:29] <jynus>	 aren't you doing a switch dc script?
[09:47:46] <volans>	 yes, it will be one of the tasks
[09:47:54] <volans>	 and yes, using cumin
[09:47:59] <volans>	 as a library
[09:48:01] <jynus>	 ok, maybe I am confused
[09:48:18] <jynus>	 why did you ask that- just as a check or to write that script?
[09:48:50] <volans>	 because I'm writing the task and wanted to be sure the selection was correct
[09:49:31] <jynus>	 so a couple of extra comments, if that is useful
[09:50:22] <jynus>	 we could get rid of "Deploy mediawiki-config with all shards set to read-only"
[09:50:22] <volans>	 sure
[09:50:37] <volans>	 how?
[09:50:38] <jynus>	 in the latest deployed mediawiki
[09:50:50] <_joe_>	  that would be great
[09:50:51] <jynus>	 because it does that if it detects masters as read only
[09:51:04] <volans>	 without throwing tons of errors?
[09:51:05] <jynus>	 _joe_, I told you about that numerous times
[09:51:14] <jynus>	 :-P
[09:51:17] <_joe_>	 jynus: but ^^ what volans said
[09:51:25] <jynus>	 that is a great question
[09:51:34] <_joe_>	 won't that cause lag/errors?
[09:51:41] <_joe_>	 uhm well we could see in codfw
[09:51:44] <jynus>	 lag no, because preciselly
[09:51:54] <jynus>	 no writes, no lag
[09:51:57] <_joe_>	 they store that data in apc, do they?
[09:52:09] <_joe_>	 no i mean user lag
[09:52:14] <jynus>	 (pt-heartbeat always writes no matter read only)
[09:52:16] <_joe_>	 before mw decides it's readonly?
[09:52:20] <jynus>	 as a failsafe
[09:52:39] <_joe_>	 or does mw test if the master is readonly on *every* request?
[09:52:39] <jynus>	 now, if it is done "well" and it caches the errors
[09:52:47] <jynus>	 I would ask the implementer (aaron)
[09:52:51] <_joe_>	 exactly :)
[09:52:53] <jynus>	 and to be fair
[09:52:56] <jynus>	 do a proper test
[09:53:05] <_joe_>	 oh, if you can understand his answer, that's great
[09:53:07] <jynus>	 not because I do not trust him
[09:53:22] <jynus>	 but because I think it has never been tested large-scale
[09:53:30] <jynus>	 a full, large shard
[09:53:39] <jynus>	 _joe_, :-)
[09:54:32] <volans>	 lol
[09:54:53] <jynus>	 we could test on codfw
[09:55:06] <jynus>	 deploy one shard on read-write
[09:55:16] <jynus>	 try to do edits
[09:55:26] <_joe_>	 yes
[09:55:36] <_joe_>	 can you guys work on that?
[09:55:44] <jynus>	 on one side- edits are not that frequent
[09:55:47] <_joe_>	 me and volans are quite overwhelmed with stuff to do already
[09:55:53] <jynus>	 sure
[09:56:13] <_joe_>	 well you can make a curl call to codfw's appservers to test it
[09:56:18] <jynus>	 dont count on it working
[09:56:34] <jynus>	 as in, 100% sure we will not have to deploy the patch
[10:00:40] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure: ug_expiry column of the user_groups table is not present on Labs - https://phabricator.wikimedia.org/T160686#3109297 (10Marostegui) Just to clarify: Moved it in our internal DBA dashboard to the "not db team" as this is normally handled by Labs.
[10:36:15] <moritzm>	 on neodymium and sarin we have both wmf-mariadb101-client and the mariadb client packages from jessie installed (mariadb-client, mariadb-client-10.0, mariadb-client-core-10.1), shall we remove the jessie ones? I have no idea why they're installed, can't find something in puppet and also no package reverse dependencies on these hosts
[10:36:50] <jynus>	 yes, I said I was going to delete those
[10:36:57] <moritzm>	 great, thanjs
[10:37:23] <moritzm>	 then I'll skip the upgrades for those (since they were updated in jessie to .30)
[10:43:43] <jynus>	 done
[10:46:16] <moritzm>	 thx
[10:51:56] <wikibugs_>	 10DBA, 06Operations, 10Phabricator, 10ops-eqiad: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3109357 (10Marostegui) Unfortunately, db1047's BBU looks totally broken, it is not making any sense in what it reports. Some places it says it is fully charged, some others don't,...
[10:57:31] <wikibugs_>	 10DBA, 10Gerrit, 06Operations, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#3109383 (10Paladox) Upstream gerrit has added in built support for the MariaDB connector. So we won't need it pac...
[10:58:56] <volans>	 jynus: why did you removed a phase in the DC switchover wiki? that's part will go away eventually but for now if it's staying has to be in a different phase than the RO/RW on the DBs
[10:59:09] <volans>	 all the steps in a phase can (and probably will) be run in parallel
[10:59:58] <_joe_>	 volans: we might need to reassess that
[11:00:50] <volans>	 yes but if we need the commit in mediawiki-config has to be done after the DBs are in RW, not in parallel
[11:09:22] <jynus>	 why?
[11:10:42] <volans>	 because if we deploy first mediawiki-config as RW (assuming the fix we talked above is still not fully working) we will start getting all kind of errors in MW
[11:10:59] <volans>	 the 2 changes have to be done in sequence
[11:11:14] <jynus>	 I disagree, but feel free to change it
[11:11:24] <volans>	 otherwise why phase 2 and 3 are separated?
[11:11:28] <volans>	 it's the same thing
[11:11:39] <jynus>	 yes
[11:11:44] <jynus>	 they should be merged
[11:11:52] <jynus>	 the difference
[11:12:03] <jynus>	 is that deploying a change takes 2 minutes
[11:12:10] <jynus>	 to mediawiki
[11:12:20] <jynus>	 while setting read-write/read only is instant
[11:12:39] <volans>	 I don't care if one takes 1s and the other 1m, if they need to be logically in sequence why putting them in parallel
[11:12:50] <jynus>	 they do not need to be logically in sequence
[11:13:00] <jynus>	 if they take more time, errors are a natural thing to happen
[11:13:25] <jynus>	 it means db change took too much time, and errors should happen
[11:13:48] <jynus>	 and again, that is assuming the automatic read-only mode doesn't work
[11:14:01] <volans>	 yes, I'm assuming that of course
[11:14:08] <volans>	 otherwise one step is just removed
[11:14:31] <jynus>	 also, the step was badly named
[11:14:40] <jynus>	 it said "failover masters"
[11:14:51] <jynus>	 so I deleted such a step, there is no master failover
[11:15:13] <jynus>	 feel free to add a new step, but it should have a proper name
[11:16:53] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 10Wikidata, 07Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010#3109422 (10Marostegui) a:05Marostegui>03None
[11:17:02] <volans>	 that's the wiki of last year, the name is the last thing. Why do you think that they don't need to be logically in sequence?
[11:17:39] <volans>	 you first tell MW to put a nice message that the site is RO and then put RO the DB, otherwise users will get errors/retries and log will get errors
[11:17:54] <jynus>	 no logical problems happen when that is done out of order
[11:18:15] <volans>	 order = sequence
[11:18:43] <jynus>	 well, then I would break down the other phases
[11:18:54] <jynus>	 I wouldn't wipw caches on phase 3
[11:19:01] <jynus>	 until masters are read-only
[11:19:54] <volans>	 sure, agree
[11:20:10] <jynus>	 again, it is a wiki
[11:20:12] <jynus>	 just edit
[11:20:19] <jynus>	 I saw a wrong phase
[11:20:28] <jynus>	 I corrected it by mergeing it
[11:20:39] <jynus>	 I will not revert if you split it further
[11:20:50] <volans>	 ok, don't worry
[11:21:25] <jynus>	 it said phase 6 - database master swap
[11:21:36] <jynus>	 that is not a phase
[11:23:27] <volans>	 regarding " new site's read-only master/slaves are caught up", any suggestions on the best way to ensure that?
[11:23:36] <volans>	 querying the heartbeat table?
[11:23:40] <jynus>	 no
[11:24:10] <jynus>	 checking the binary position of the old master
[11:24:25] <jynus>	 then doing gtid_pos_wait on the other master
[11:24:29] <jynus>	 or similar
[11:25:34] <volans>	 ok
[11:30:31] <volans>	 jynus: also, nothing has to be done for pt-heartbeat right? It's already active/active from both masters
[11:31:05] <jynus>	 yes
[11:31:11] <volans>	 great
[11:31:20] <jynus>	 the variable has to change on mediawiki
[11:31:44] <jynus>	 in fact, nothing technically should be needed for databases
[11:31:57] <jynus>	 but we do not trust mediawiki not to write in read-only mode
[11:32:02] <volans>	 yeah
[11:38:03] <jynus>	 even if it looks like I am doing nothing- I am exporting and doing row checks like crazy right now for T154485
[11:38:04] <stashbot>	 T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485
[11:40:21] <jynus>	 that may show some craxy spikes one: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s2&var-role=master
[11:49:54] <wikibugs_>	 10DBA: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3107862 (10Marostegui) You currently don't have DROP privilege, which would give you TRUNCATE grants too. This is what you have for testreduce_vd ``` GRANT SELECT, INSERT, UPDATE, DELETE, ALTER ON `testreduce_vd`.*...
[12:31:13] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3109518 (10jcrespo) Unless anyone else says so, I will reimage the old server on Monday. Last chance to check data and functionality works on the ne...
[12:47:10] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure: ug_expiry column of the user_groups table is not present on Labs - https://phabricator.wikimedia.org/T160686#3109542 (10chasemp) 05Open>03Resolved should be good to go, let me know if not
[13:25:26] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 06Multimedia, 05MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)), and 3 others: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415#3109616 (10Marostegui) dbstore2002 and db2058 are done: ``` root@ne...
[13:26:11] <wikibugs_>	 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3109619 (10Marostegui) dbstore2002 and db2058 are done: ``` root@neodymium:~# for i in db2058 dbstore2002; do echo $i; mysq...
[14:20:15] <wikibugs_>	 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3109759 (10Halfak) OK.  Time to try to ping the larger set of people who have databases here.  Here's the databases that match Phab users: * @dartar (dartar) * @drdee (diederi...
[14:21:47] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109766 (10Marostegui) 05Open>03stalled Let's block this as db1047 might be decommissioned soon as per: T156844
[14:32:39] <wikibugs_>	 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3109809 (10Ottomata) Wow, old DB!  decleramaul, nimish and rfaulk we can certainly get rid of.
[14:55:16] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109863 (10Cmjohnson) @marostegui There are a few decom db's now I could swap out the bbu if you like or just proceed with the decom process.   Let me know your prefe...
[14:58:01] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109864 (10Marostegui) Hey @Cmjohnson! let's wait to see if that ticket keeps progressing for now, if the server is going to get decommissioned it would be just a was...
[14:59:22] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109866 (10jcrespo) @Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it is not going to happen per months as it looks....
[14:59:28] <wikibugs_>	 10DBA: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109867 (10ssastry) 05Open>03declined I can work with that. I was mostly checking in case you guys had thoughts about this.
[15:02:19] <wikibugs_>	 10DBA: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109872 (10jcrespo) 05declined>03Open a:03jcrespo I think we could give truncate grants on that database to ssastry. He demonstrated a responsible usage of the database- we shouldn't put burdens to it if it ma...
[15:09:35] <volans>	 jynus: what do you meant before with doing gtid_pos_wait on the other master?
[15:09:50] <jynus>	 volans, let me show you code
[15:09:56] <jynus>	 it is difficult to explain in prose
[15:09:59] <volans>	 sure
[15:10:12] <jynus>	 let me show you how mediawiki does it and you can be inspired or copy it, etc.
[15:10:35] <jynus>	 volans, one sec that I finish something
[15:10:38] <_joe_>	 look, we're a couple of dumb developers
[15:10:45] <_joe_>	 just paste us the sql query
[15:10:47] <jynus>	 ?
[15:10:54] <_joe_>	 and we will blindly use it in our code :P
[15:10:56] <jynus>	 _joe_, I do not know by heart
[15:11:05] <jynus>	 that is why I have to search it
[15:11:15] <_joe_>	 jynus: j/k ofc
[15:11:22] <jynus>	 and I trust mediawiki code than my hacks
[15:11:25] <jynus>	 *more
[15:11:33] <volans>	 ok, otherwise I can get the pos on the old and check that exec on the new is >= of that
[15:12:32] <jynus>	 includes/libs/rdbms/database/DatabaseMysqlBase.php
[15:12:40] <jynus>	 SELECT MASTER_GTID_WAIT($gtidArg, $timeout)
[15:12:58] <jynus>	 so that waits up to $timeout seconds
[15:13:23] <jynus>	 and returns a value depending if it waits succesfully or it timeouts
[15:13:27] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109921 (10Marostegui) >>! In T159266#3109866, @jcrespo wrote: > @Marostegui precisely on that ticket they are discussing when they will be able to decom it, and it i...
[15:13:44] <jynus>	 the value to wait for is "SHOW GLOBAL VARIABLES LIKE 'gtid_binlog_pos'"
[15:13:55] <jynus>	 I can give you the whole line
[15:14:01] <jynus>	 if you give me 5 minutes
[15:14:10] <jynus>	 to finish 1 thing I am about to finish
[15:14:20] <volans>	 probably that's enough, thanks, finish your thing, no hurry
[15:14:27] <jynus>	 I want to help
[15:14:32] <jynus>	 I just want to finish this first
[15:14:50] <jynus>	 I may even be able to send you a CR
[15:14:59] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109930 (10Marostegui) >>! In T160691#3109872, @jcrespo wrote: > I think we could give truncate grants on that database to ssastry. He demonstrated a responsible usage of the database- we shoul...
[15:15:17] <jynus>	 marostegui, "I had the impression that for db1047 it was going to be pretty fast"
[15:15:23] <jynus>	 :-D
[15:15:47] <jynus>	 I will believe it when I see it
[15:16:13] <marostegui>	 XDDD
[15:16:18] <jynus>	 and I am not critizising anyone- we wanted labs gone a year ago- and that is with us as blockers
[15:16:19] <marostegui>	 I said from the comments! :)
[15:16:32] <jynus>	 imagine with us, analytics and reasearch at the same time...
[15:20:23] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3109945 (10jcrespo) @ssastry check if the above patch is what you want^.  Of course, with great power comes great responsibility :-) - which means if you drop a table (or the script does it for...
[15:20:31] <jynus>	 ok
[15:20:35] <jynus>	 I am with you now, volans 
[15:21:49] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109947 (10Ottomata) Given the responses so far, I think we will be able to decom it soon.  But, we should wait a while (maybe a week) to collect more feedback to be...
[15:21:50] <volans>	 thanks, but I get the point
[15:22:03] <jynus>	 have you considered integrating WMFMAriaDB.py on your scripts?
[15:22:06] <volans>	 I get the gtid pos on the eqiad masters
[15:22:20] <jynus>	 it will be easier if python talks to python
[15:22:35] <jynus>	 rather than going through an shell output
[15:22:47] <jynus>	 specially to check 10 masters
[15:23:34] <jynus>	 https://mariadb.com/kb/en/mariadb/master_gtid_wait/
[15:23:53] <jynus>	 If the wait completes without a timeout, 0 is returned.
[15:24:03] <jynus>	 If the timeout expires before the specified GTID position is reached, then the function returns -1
[15:24:12] <volans>	 yep, already tested
[15:27:00] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3109954 (10Marostegui) Let's wait then to see how that ticket progress next week or so in order not to make Chris to replace it and then a few days later decom the se...
[15:27:48] <_joe_>	 jynus: so actually while I think it is useful in general, in this specific case we want to parallelize connections to all masters
[15:28:05] <_joe_>	 doing so in python requires more time than parsing a bash output in python
[15:28:15] <_joe_>	 as cumin can already execute everything in parallel
[15:28:15] <jynus>	 what?
[15:28:26] <jynus>	 yes, I am not saying to use cumin
[15:28:31] <jynus>	 *not use
[15:28:43] <jynus>	 ah, I get you
[15:28:52] <jynus>	 cuming does not have "native mysql execution"
[15:28:58] <_joe_>	 yes
[15:29:09] <_joe_>	 because parallelization is done by the transport
[15:29:11] <jynus>	 buah ha ha, I am thinking badly
[15:29:20] <jynus>	 something that volans will not like
[15:29:27] <jynus>	 which is integrating it :-)
[15:29:33] <_joe_>	 cumin can just execute remote commands
[15:29:38] <jynus>	 but not now
[15:29:49] <_joe_>	 actually, you can just create your own library and integrate it in single cumin tasks
[15:29:50] <volans>	 we can add a mysql transport if we see fit
[15:29:58] <volans>	 at some point ;)
[15:30:00] <jynus>	 it is just it soulnd in saome cases a waste 
[15:30:08] <_joe_>	 we'll show you when we have something more serious
[15:30:08] <jynus>	 to connect localy to run mysql
[15:30:26] <jynus>	 when we could just run mysql- not in this case
[15:30:34] <_joe_>	 yup
[15:30:35] <jynus>	 for things like imports
[15:31:05] <jynus>	 or to pass complex data messages
[15:31:19] <jynus>	 anyway, you know I have more ideas than time to commit to them
[15:32:00] <volans>	 :)
[15:47:11] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110035 (10ssastry) Thanks. That works! Agreed about power and responsibility. :-)
[15:50:37] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110043 (10jcrespo) @ssastry sorry- I may not have been 100% clear: Question- does your data need backup, and which of your databases need it? We provide that service, but we need to know which...
[15:55:27] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110054 (10ssastry) Ah, okay! We don't need backups of these databases. These are primarily used for testing and losing the contents is not catastrophic. But, if we decide otherwise in our team...
[15:56:54] <wikibugs_>	 10DBA, 10Analytics, 10Analytics-EventLogging, 10ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#3110057 (10Nuria) ping @Jdforrester-WMF looks like you need to remove the instrumentation that is sending events.
[16:02:37] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110074 (10jcrespo) Thanks, I will deploy the change, please test that it works for you when done.
[16:53:43] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3110246 (10jcrespo)
[16:53:47] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Xtools: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 - https://phabricator.wikimedia.org/T133321#3110244 (10jcrespo) 05Resolved>03Open
[17:00:22] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Xtools: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 - https://phabricator.wikimedia.org/T133321#3110269 (10Matthewrbowker) The cleanup job has been running successfully.   I ran it manually, here is the output.  ``` 16:54 [xtoo...
[17:01:14] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#3110275 (10jcrespo)
[17:02:39] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3110283 (10jcrespo)
[17:02:42] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 - https://phabricator.wikimedia.org/T133322#2228158 (10jcrespo) 05Resolved>03Open labsdb1003 is now constraineed, and one of your databases have >100GB in space. They look like simple copies of produc...
[17:11:13] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3110330 (10MaxSem) The new server works for me. The upgrade also resolved T145599. Thank you!
[17:18:55] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs: labsdb1001 and labsdb1003 short on available space - https://phabricator.wikimedia.org/T132431#3110371 (10jcrespo)
[17:19:00] <wikibugs_>	 10DBA, 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Xtools: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 - https://phabricator.wikimedia.org/T133321#3110369 (10jcrespo) 05Open>03Resolved > What is the difference between labsdb1001 and labsdb1003? Does labsdb1001 correlate to...
[17:29:37] <wikibugs_>	 10DBA: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3110389 (10Marostegui) Finished running pt-table-checksum on frwiki. Differences found on: ``` Differences on db1030 frwiki.archive ```  ``` Differences on dbstore1002 frwiki.archive frwiki.page_props frwiki.wbc_entity_usage ```
[18:49:30] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110583 (10jcrespo) Please test the changes and see if it works.
[18:59:20] <wikibugs_>	 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3110616 (10GLCiampaglia) I am pretty sure my tables can be safely deleted. Thanks for the heads up!  Giovanni
[19:33:57] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3110658 (10ssastry) Thanks! A test is running right now (and will probably take another day to finish), so I won't be able to test this right now. But, will try it out after the test run finishes.
[21:11:47] <wikibugs_>	 10DBA, 13Patch-For-Review: Refreshing testreduce_vd database on ruthenium - https://phabricator.wikimedia.org/T160691#3111007 (10ssastry) 05Open>03Resolved I had to terminate the test for disk space reasons. I reduced the test corpus size .. and long story short .. i was able to run truncate successfully.
[22:58:25] <wikibugs_>	 10DBA, 10Analytics, 06Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3111341 (10Tbayer) I have been using `db1047` quite frequently for EventLogging queries as an alternative to `dbstore1002`, either because it was (at times) much faster, or in...