[05:23:03] 10DBA, 10Data-Services: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#1677380 (10Marostegui) They are not being replicated because they are marked as private table: https://github.com/wikimedia/puppet/blob/production/manifests/realm.pp#L201 Why? That I don't kn... [06:54:47] 10DBA, 10Operations, 10ops-eqiad: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572381 (10Marostegui) [06:55:06] 10DBA, 10Operations, 10ops-eqiad: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572394 (10Marostegui) p:05Triage>03Normal [06:56:35] db2044 down? [06:59:07] 10DBA: db2044 HW issues - https://phabricator.wikimedia.org/T174764#3572397 (10Marostegui) [07:02:32] 10DBA: db2044 HW issues - https://phabricator.wikimedia.org/T174764#3572409 (10Marostegui) From the logs: ``` /system1/log1/record16 Targets Properties number=16 severity=Critical date=09/01/2017 time=06:18 description=Drive Array Controller Failure (Slot 0) ``` [07:10:56] 10DBA: db2044 HW issues - https://phabricator.wikimedia.org/T174764#3572418 (10Marostegui) I have rebooted the server because it was basically unresponsive and it has came back fine apparently. I will do some more checks before starting MySQL [07:16:33] 10DBA, 10Operations, 10ops-codfw: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3572438 (10jcrespo) [07:17:03] I saw it but as it was passive wanted to have a look at it later [08:32:26] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572591 (10jcrespo) Security seems to agree on it being possibly public: https://phabricator.wikimedia.org/T103011#3536648 I will ask him to +1 that patch. [08:38:54] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572605 (10jcrespo) Once that is merged, it may take some time to deploy, as we have to create the tables copying it from production- apparently it is only used on wikis... [09:41:59] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572734 (10Marostegui) Yeah - I thought about creating it on enwikisource to unblock this ticket first and then slowly create it on the big wikis and so forth [09:46:45] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572744 (10jcrespo) > I thought about creating it on enwikisource to unblock this ticket first Sadly, it is all or none, because replication will break otherwise (break... [10:08:10] Looks like mariadb is working to fix the multisource+gtid things! [10:12:58] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3572774 (10jcrespo) image is progressing slowly- it is a large table and it has lots of differences (rows existing on certain hosts that should be deleted images). [10:18:31] 10DBA, 10Operations, 10ops-codfw: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3572787 (10Marostegui) 05Open>03Resolved a:03Marostegui After rebooting the server again, everything looks good again and I see no more HW errors. I have started mysql and replication and everything is... [10:19:23] 10DBA, 10Operations, 10ops-codfw: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3572790 (10Marostegui) p:05Triage>03Normal [10:27:33] do you have a ticket? [10:27:44] for what? [10:28:10] Looks like mariadb is working to fix the multisource+gtid things! [10:28:19] ah [10:28:20] syes [10:28:32] https://jira.mariadb.org/browse/MDEV-12012 [10:28:37] thanks [10:28:42] i need to check the comments, just saw the update on my mail [10:49:20] ping me if you remember when db1081 is back up - It think I will have to do some purging of some rows or replication could break [10:49:39] (I made the same change to all other hosts) [10:51:52] yeah, it is about to get up [10:52:15] it is not a rush, but I wanted to get access before you start slave [10:52:21] to avoid problems [10:52:36] ah [10:52:39] it is now up [10:52:41] with replication stopped [10:52:55] ok, let me check it, get blocked on me [10:53:00] sounds good! [10:53:59] ok, donw [10:54:03] you can start slave [10:54:16] ok! [10:54:17] I will share the list of deletions with you/publicly when I have done all [10:54:19] that was fast [10:54:22] thanks [11:10:33] This was https://commons.wikimedia.org/w/index.php?title=File:Air_Traders_Vickers_Vanguard_at_Ibiza_Airport.jpg&redirect=no [11:10:56] <3 [11:11:04] the page was moved but it showed an image on some hosts [11:15:37] image is much slower but reliable to fix, because with the wiki logs you know what is the right state [11:16:52] it is a lot of time but it will unlock so many things, and it will only have to be done once [11:18:53] yeah, i imagine you like: https://bristolenos.files.wordpress.com/2016/10/dia_arqueologo.jpg?w=640 [11:21:57] I actually was fixing most of the time 2005 records [11:22:24] 2005?? [11:22:25] I think most of these issues date back to INSERT ... SELECT days [11:22:25] wow [11:22:44] which is nice it is corrected on code [11:22:53] but the older records stay [11:41:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3572877 (10Marostegui) All s4 hosts have been upgraded to 10.0.32. Obviously not the master (see below) So the remaining steps are: - DBAs: to alter s1 mas... [11:48:01] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572881 (10Reedy) Wow... It's on enwiki https://github.com/wikimedia/mediawiki-extensions-ProofreadPage/blob/master/sql/ProofreadIndex.sql I'd propose dropping it from... [11:50:09] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572883 (10Marostegui) >>! In T113842#3572881, @Reedy wrote: > Wow... It's on enwiki > > https://github.com/wikimedia/mediawiki-extensions-ProofreadPage/blob/master/sql... [11:51:07] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572887 (10jcrespo) > I'd propose dropping it from anywhere that isn't in this list... I would support such an action, if it is an optional extension, it should not crea... [11:54:46] 10DBA, 10Data-Services, 10Patch-For-Review: `pr_index`to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#3572890 (10Reedy) I'm guessing someone messed up onetime running it over all wikis, rather than a subset. Or at some point, maybe we didn't have a way of making it easie... [11:56:03] 10DBA: Drop pr_index from wikis where ProofreadPage isn't enabled - https://phabricator.wikimedia.org/T174782#3572891 (10Reedy) [12:00:40] 10DBA: Drop pr_index from wikis where ProofreadPage isn't enabled - https://phabricator.wikimedia.org/T174782#3572923 (10jcrespo) [12:03:13] 10DBA: Drop pr_index from wikis where ProofreadPage isn't enabled - https://phabricator.wikimedia.org/T174782#3572924 (10jcrespo) [12:03:17] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3572925 (10jcrespo) [12:07:16] 10DBA, 10Operations, 10ops-eqiad: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572931 (10Marostegui) [12:23:00] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572948 (10Marostegui) [12:23:14] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572381 (10Marostegui) [13:10:28] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3573026 (10Marostegui) [13:13:35] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3573040 (10Marostegui) a:03Cmjohnson db1026 is now ready to be decommissioned and all the pending steps are DC Ops ones, so I am handing this over to @Cmjohnson [13:39:31] 10DBA: Drop pr_index from wikis where ProofreadPage isn't enabled - https://phabricator.wikimedia.org/T174782#3573115 (10Marostegui) These empty tables current exist on the following wikis, and should be dropped from there: s1 (empty) - nothing to backup ``` enwiki ``` s2 (all empty) - nothing to backup ``` bg... [13:44:40] 10DBA: Drop pr_index from wikis where ProofreadPage isn't enabled - https://phabricator.wikimedia.org/T174782#3573126 (10Marostegui) I have renamed the table on enwiki, on db1089 just to make sure nothing breaks ``` root@db1089[enwiki]> show tables like 'T17%'; +-------------------------+ | Tables_in_enwiki (T17... [15:03:45] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3573336 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1101.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim... [15:25:33] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3573403 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1101.eqiad.wmnet'] ``` and were **ALL** successful. [15:28:46] 10DBA: Drop flaggedrevs tables on wikis where it is not enabled - https://phabricator.wikimedia.org/T174801#3573416 (10demon) [15:29:42] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3573431 (10Marostegui) [15:30:13] 10DBA, 10Epic: Drop education program (ep_*) tables on wikis where it is not enabled - https://phabricator.wikimedia.org/T174802#3573433 (10demon) [15:30:22] 10DBA: Drop education program (ep_*) tables on wikis where it is not enabled - https://phabricator.wikimedia.org/T174802#3573449 (10demon) [15:36:50] marostegui: Curious...can there be any kind of replication issues if one drops tables that are already empty? [15:37:02] re: T174802 [15:37:03] T174802: Drop education program (ep_*) tables on wikis where it is not enabled - https://phabricator.wikimedia.org/T174802 [15:37:33] Still prefer to file tasks and let y'all handle it, but wondering :) [15:37:45] Presumably it replicating onto slaves with no rows is quick (ish) [15:38:47] That's what I assumed, but figured couldn't hurt to ask [15:39:58] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3573469 (10Marostegui) [15:41:36] no_justification: If dropped with "if exists", then no [15:41:45] But yeah, better to open a task so it is tracked and we can handle it [15:42:25] no_justification: If dropped without: if exists, if for whatever reason it doesn't exist on any or some slaves, replication will break [15:44:45] no_justification: yeah, I would keep the monopoly of drops for one reason: potential ongoing maintenance [15:47:03] let me give you an example: DAtabase A is backed up, table is dropped, Database is recovered- your table has appeared magically [15:47:55] there is nothing to avoid that except lots of coordination [15:48:28] Yeah totally. Not trying to rake over you both do a great job as is on these cleanup tasks. Was mostly asking for my own knowledge :) [15:48:49] sure [15:48:52] *take over [15:49:00] I actually would love to get some help there [15:49:13] DROP ALL OF THE TABLES [15:49:29] but that normaly would take a huge burden in keeping with sync with not one but other 2 people [15:50:08] even going 1 -> 2 was quite a change, and we both have lots of experience working on the same dbs with other people [15:50:09] Reedy: that would fix any disk space issue we might have [15:50:29] resolved wfm [15:50:32] I think the best way to help is to help classiflying and getting things prepared [15:50:45] which you are already doing, and I am thankful for that [15:50:57] Our goals are aligned here :) [15:51:32] for example, sometimes the tasks are lacking in details, so we have to see what has to be dropped and where [15:51:45] Just training us to tag/categorise things properly so you can find the stuff... :D [15:52:09] Yeah. The two I just filed needs investigation, which I plan to do [15:52:20] Gotta find the diff of what wikis have the tables and what wikis *need* the tables [15:52:36] and I am sure you both have more mediawiki experience that both of us combined [15:53:17] sometimes we get "drop abracadabra tables" [15:53:34] I am am like WTF are the avbracadabra tables? [15:53:35] is it an extension [15:53:43] no_justification: If you ask nicely, the DBAs seem to be able to check for table existence across all dbs pretty quicky [15:53:45] is it on enwiki, x1, wikitech? [15:53:59] yep, he have some scripts [15:54:20] https://gerrit.wikimedia.org/r/#/c/354206/ [15:54:31] Wonder if either of you could re-run the searches that got us the initial table on T54921 [15:54:32] T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 [15:54:39] although the goal is to have a better inventory system at some point [15:54:57] https://phabricator.wikimedia.org/T104459 [15:54:57] incremental improvements ftw [15:55:42] a database of databases [15:56:23] so meta [15:56:25] much data [15:56:25] which may sound silly, but we probably have bilions of objects (indexes, columns, tables) on all fleet [15:56:25] wow [15:56:52] Certainly knowing what tables are where, why, and from what source etc would be nice [15:56:58] yep [15:57:05] also for schema changes [15:57:22] 90% of the time is to know where to do them, 10% actually do them [15:57:27] heh [15:57:45] I'm still bemused how pr_index got everywhere [15:57:49] like, how many servers are pending/failed in the middle [15:57:56] Actually.. Could someone look when the files on disk were created on a wiki it's not used? [15:58:53] Mar 12 2013 [15:59:07] although the ibd was created on Mar 22 2013 [15:59:24] So it's not a recent thing at least [15:59:29] 2013...? I would've expected it have been older [15:59:31] tbh [15:59:35] it could be older [15:59:41] it could have been backed up [15:59:44] and restored [16:00:00] or rebuilt as part of regular maintenance [16:00:05] Ah true true [16:00:36] You know what I imagine happened? Someone used foreachwiki instead of foreachwikiindblist [16:00:42] So created...everywhere? [16:00:46] yeah [16:00:50] Indeed [16:01:01] I would like to have some kind of dashboard [16:01:08] that you people can also access [16:01:17] for non-alert worthy [16:01:22] but strange stuff [16:01:32] tables growing too large [16:01:39] tables that shouln't be there [16:01:41] etc. [16:03:35] EducationProgram is a good target right now. There's a (sorta) clear migration path from using it so we're slowly winding down what wikis use it. Cuz it adds 10 tables.... [16:04:41] I do not mind tables that are rarely used, but I bet we have a lot that shouldn't be there, and when multiplied by 900 wikis, that ends up adding up [16:04:50] Indeed! [16:05:06] querying all tables on s3 now takes at least 10 minutes [16:05:53] Yeah, rarely used is better than not used :P [16:06:04] I'm still surprised how long some of the tmp and old tables have hung around [16:06:11] information_schema got some indexes on 8.0, but we are far away from there [16:06:18] and pre_mwversion [16:06:38] yeah, as a dba, you do a test, then you forget because you have 100.000 other things to do [16:06:51] and more before if there was even not a dedicated dba [16:07:45] Do we have a cron/something that can regularly truncate certain tables? For example, updatelog we don't really use in production, but sometimes a maintenance script might inject something there. [16:07:47] Just 100,000? [16:07:50] You're slacking [16:08:01] I know, its summer! [16:08:05] Like, we've already cleared X table before, but we really don't want stuff cluttering it again [16:08:11] at least on the north hemisphere [16:09:36] it's raining? [16:09:38] Yup, that's summer [16:09:50] no_justification: ufff that brings some bad memories back [16:09:56] crons going wrong... [16:10:01] and truncating actual production data [16:10:14] Fair 'nuff :) [16:11:17] 10DBA, 10Epic, 10Tracking: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#3573512 (10demon) [16:11:29] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#3573526 (10demon) [16:11:37] Could've sworn we'd done that one [16:11:43] But no task, and I'm seeing data all over the place for it [16:11:54] I guess any logged maintenace subclass willl add to it [16:12:15] Reedy: On enwiki, lol. (ul_key, ul_value): 1.17wmf1-final, NULL [16:12:30] woop [16:12:54] There's some funny shit here. [16:13:02] Also: updatelog is dumb, I never should've expanded its use [16:13:16] Scripts should /detect/ if they need to do work, not rely on some silly table [16:13:18] :) [16:13:27] heh [16:30:59] 10DBA, 10Operations, 10ops-eqiad: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3573553 (10Marostegui) [16:31:16] 10DBA, 10Operations, 10ops-eqiad: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3573566 (10Marostegui) p:05Triage>03Normal [16:40:46] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3573581 (10jcrespo) image is done until letter C- This may sound like a small portion, but according to the records, it is around 2/3rds of the way. [16:42:51] ^ nice one, great tedious work [16:45:09] image/oldimage and page/revision/archive are the ones which actually caused outages in the past [16:45:16] on s4 [16:46:25] so why do we have image and oldimage and oldimage is still in use? [16:46:31] do you know the history of that? [16:46:49] Same as revision/archive [16:47:03] Basically: deleted stuff gets moved to the oldimage/archive table [16:47:23] ah right! [16:47:28] thanks [16:47:52] Now, those tables have _deleted columns now, so we could--in theory--move to where we just flag those and never copy/delete the entries [16:48:03] But right now those columns are only used for selective individual revision/image deletion [16:48:16] the plan, probably [16:48:32] That was always an end-goal when we moved Oversight into core as RevDeletion [16:48:37] is to make image/oldimage behave more like reivision/page [16:49:26] https://phabricator.wikimedia.org/T28741 [16:49:41] but you still need to get them fixed anyway [16:55:20] Authored By [16:55:21] Catrope, Jan 15 2011 [16:55:24] hehehe [16:55:36] nah, there was a more recent meeting about it [16:56:14] https://phabricator.wikimedia.org/E228 [16:58:54] ah right [17:08:27] this confirms load issues is not only a tokudb thing: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1001&var-port=9104&from=now-7d&to=now [17:08:48] *innodb, also affects tokudb [17:15:16] yeah i think we are really now reaching all the dbstore limits [17:15:19] replication-wise [17:18:23] Could I bug one of y'all for a quick root action on tin? I just need /srv/deployment/STALE removed. This is per T170881, specifically the last comment [17:18:24] T170881: Cleanup /srv/deployment - https://phabricator.wikimedia.org/T170881 [17:33:16] Nvm actually :)