[01:54:25] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul) [01:54:34] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul) p:05Triage→03Medium [02:02:01] 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Andrew) ` mysql:root@localhost [labtestwiki]> alter table page change page_restrictions page_restrictions varbinary(255) DEFAULT ''; Query OK, 27971 rows affected (3.80 sec)... [05:09:54] 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui) Yes! Thank you :) [05:10:02] 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui) [05:46:22] 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [05:49:16] 10DBA: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 (10Marostegui) [05:49:59] 10DBA: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 (10Marostegui) [05:50:02] 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [05:50:05] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) [05:53:43] 10DBA, 10Patch-For-Review: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 (10Marostegui) [05:55:23] 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) >>! In T231520#6011841, @bd808 wrote: >>>! In T231520#6009325, @Marostegui wrote: >> I have also run some queries via Quarry and I have seen th... [05:59:11] 10DBA, 10cloud-services-team (Kanban): Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 (10Marostegui) 05Open→03Resolved a:03Marostegui `nova` and `nova_api` dropped (they had their tables renamed) [06:50:28] there's a warning for backup1001 about disk space on /srv/databases [07:34:01] 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui) [08:27:59] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) [08:28:14] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) p:05Triage→03Medium [08:36:32] for whenever you have sometime https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/584886/ [08:48:35] thanks! [08:53:24] 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2093.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003310853_marostegui_... [09:09:57] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) [09:25:17] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2093.codfw.wmnet'] ` and were **ALL** successful. [09:29:13] godog: IRC will suffice or prefer videomeeting? [09:29:44] jynus: irc sounds good to me! I think it'll work [09:30:55] So I was checking to see if and how we could backup media files, for some meaning of that [09:31:11] well, I was asked to have a look at it [09:32:05] but I really have little knowledge of what is there, to have some high level understanding on how that could work, sizes and specially for purchases for next fiscal [09:32:46] I saw https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&fullscreen&panelId=9 [09:33:08] is that after or before redundancy? with or without compression= [09:33:10] ? [09:33:56] that is before redudancy, i.e. the size of containers from swift's point of view [09:34:15] no compression in swift itself, so yeah nothing like it [09:34:28] but I am guessing compression would help very little? [09:34:43] if videos and jpgs mostly? [09:34:45] yeah I think so too, mostly it is already compressed file formats anyways [09:35:10] and growth rate has been mostly linear at current rate, does it accelerate? [09:35:13] the other consideration I think would be which containers to backup, e.g. do we want thumbnails ? [09:35:26] ah, is there a percentage in bytes? [09:35:35] there is 2 views of that [09:35:40] "backup of service" [09:35:47] which could include thumbs [09:35:52] and "backup of data" [09:36:19] thinking on long term survivality [09:36:51] there is not concrete plans right now [09:37:14] regarding dcs, swift is only on eqiad and codfw? [09:37:15] I see, yeah you'll see on that dashboard above that in terms of number of files backing up thumbs is a significant effort [09:37:48] like two orders of magnitude more files [09:37:55] both codfw and eqiad, active/active [09:37:55] well, the how is something I do not want to engage right now [09:38:05] that is a project on itself [09:38:08] :-D [09:38:25] haha fair [09:38:35] for now I want a better high overview understanding of where we are [09:38:42] mostly for hw/needs [09:38:48] also if you have thoughts on it [09:39:08] e.g. "no reason to perform backups because media is fully online" [09:39:21] or "backups would only make sense on a new dc" [09:39:45] maybe the how could be coordinated later with dumps generation [09:40:19] is there a saner way to perform backups without http api? would it make sense to do raw file backups? [09:40:48] or would it make sense to have an additional, offline swift cluster? [09:41:06] do you have any existing thought on that? [09:41:30] if it helps, here is an overview of database backups: [09:42:07] dumps in a separate format that is highly compatible for long term survilibility [09:42:29] and shapshots in the original format for "quick" service recovery [09:42:46] the difference vs redundancy is that they are offline- application cannot access the backups [09:43:51] e.g. backups are not useful if they cannot be recovered properly :-D [09:44:13] ok! thanks for the overview, in terms of my thoughts on it: we have quite a bit of redudancy and of course backups of at least originals would be nice to have [09:44:50] to save us from other disasters like application level troubles that start deleting things [09:44:52] but you think those should be in "dump" format, not a filesystem copy of openstack? [09:45:07] aka the original files stored somehow? [09:45:43] yeah I'm thinking the files pulled from the api not from the filesystem [09:45:44] maybe with a "magical method" we would offline daily copies somewhere else? [09:45:57] magic method to be developped :-D [09:46:16] but those would take a lot to recover fully, right? [09:46:53] yeah for sure, I don't have exact numbers ATM [09:47:02] on how long that is [09:47:12] sure, just to be clear, I am not asking for that, just in case you already knew it [09:47:39] anything you could tell me would be useful for trying to provision such work (broadly) [09:47:47] what about location? [09:48:05] would you think we should go a 3rd site, outside of eqiad and codfw? [09:48:32] or an offline copy, eg. close to the dumps would be already very useful [09:49:00] what about privacy, would those dumps be public, or we we have to do some separation between public and private media (does that exist?) [09:49:27] yeah on the 3rd location personally I don't think so, codfw/eqiad seem diverse enough to me [09:49:30] e.g. private wiki media, or deleted media [09:49:49] is that easy with swift or would it require mw metadata? [09:50:19] e.g. officewiki is private of course, so yeah mw has the best view on all of that [09:51:18] ok, so what we really are thinking here, correct me if I am wrong, is really a media dumping process [09:51:48] what about resources, would current swift nodes support such extra reads of originals? [09:52:09] or would the cluster need expansion/extra resources? [09:52:14] are they on 1G networks? [09:53:15] nope all 10G, I think reading out the files shouldn't be a big deal, I mean we can do it as slow/fast as we want [09:53:30] ok, that facilitates things [09:54:14] I will do further research, but I am guessing originals will take around 400TB total? [09:54:59] and +100GB every day [09:55:18] more or less, we're at ~275TB for originals now [09:55:19] does does number sound ok, don't need exact figures [09:55:34] ah, so thumbs take a lot more than I though [09:56:02] yeah so that's another whole story, chapter "we don't clean up thumbs ever (yet)" [09:56:14] that's ok, not judging! [09:56:34] but anyways, daily growth for originals is in the 250GB range [09:57:34] ok, so I think 550TB would be enough for 3 years I think [09:57:50] originals only [09:58:13] that's actually duable [09:58:24] not sure how, but at least in theory [09:59:05] we could partition by wiki (which would be mostly useless) [09:59:13] and then by date or something [09:59:40] yeah I'm not sure how either, agreed by wiki would be mostly commons anyways [09:59:58] and for the initial dump we could raw-access metadata on db for fast [10:00:16] let's assume we can dump at 5Gbit/s [10:00:36] one other consideration I think is whether say the internet archive would be interested in hosting a copy, of course of archival files not individual files [10:00:56] yeah, public part yes [10:01:06] we would have to solve how to send it [10:01:25] so we still want to host it [10:01:37] maybe restricted first so it can be mirrored [10:01:44] then we open it more [10:01:53] don't know , that is more "dumps" than backups [10:02:22] will have to talk with ariel to see how that could be done [10:02:57] what's "that" in this context ? [10:03:12] the sharing [10:03:26] for backups I am interested on the taking and storing :-D [10:03:33] and the recovery, of course [10:03:51] if it can be shared at the same time with 3rd parties, cool too [10:04:21] but my initial scope would be just disaster recovery [10:05:03] ah ok, got it [10:05:26] are all operations with swift done logically at the moment? [10:05:41] e.g. when a host fails for any reason, or a new one is added [10:06:01] does it join the cluster using swift only? [10:07:20] if I understand the question correctly yes, an host doesn't exist from swift's POV unless it is in its ring files [10:07:36] ok [10:07:42] and we add/remove hosts from ring files manually [10:10:12] ok, with a constant download of 5Gbit/s, it would take only a few days (6) to do a "full copy" [10:10:23] which probably would have to be slower [10:10:32] but it is not 1 year, which would be impossible [10:10:56] I think all this is doable, just very complicated :-D [10:12:02] we would also need recovery functionality [10:12:33] probably the most common scenario: 1 file was lost (for any reason), recover from backup [10:13:21] "insert it back to swift, update metadata" [10:13:39] but yes, this looks more like dumping that backups [10:14:47] I think that is all, godog, thanks [10:15:02] I just needed some extra info, none of this is yet planned [10:15:18] but it is planning for planning :-D [10:17:31] haha! indeed, like office space teaches [10:17:34] yw jynus [10:46:01] 10DBA: Create WikibaseQualityConstraints table on commons - https://phabricator.wikimedia.org/T248967 (10Cparle) [10:48:39] 10DBA: Create WikibaseQualityConstraints table on commons - https://phabricator.wikimedia.org/T248967 (10Marostegui) @Cparle table creations are part of normal deployments, they require no DBA intervention: https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change However, we do need to kno... [11:54:22] I just purged the x1 backups on dbprov[12]001 [11:56:42] cool [11:56:43] thanks [14:13:29] 10DBA: Create WikibaseQualityConstraints table on commons - https://phabricator.wikimedia.org/T248967 (10Marostegui) As per our chat on IRC - there will be no private data, there, it is safe to let it replicate to wiki replicas. Please comment here once the table is created so I can compress it. Thank you! [15:07:58] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) So, this wasn't as easy as it looked like :-) db2093 (and tendril) uses tokudb, and the latest packages we are compiling we do not compile tokudb. I have built a new package for 10.4.12 with tokudb enabled: ` cmake . -D... [15:16:11] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Definitely related: https://jira.mariadb.org/browse/MDEV-15034 [15:20:03] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) This comment is interesting ` Permalink mschorm Michal Schorm added a comment - 2019-05-10 03:54 Can be closed. I finally found a solution. The "WITH_JEMALLOC" is likethe only option, that does not accept uppercase v... [15:35:27] 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) So looks like adding this to the systemd unit works: ` [Service] Environment="LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2" ` ` mysql:root@localhost [(none)]> INSTALL SONAME 'ha_tokudb'; Query OK, 0 rows affected... [16:01:01] 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) From 10.2 (https://mariadb.com/kb/en/changes-improvements-in-mariadb-102/): ` MariaDB is no longer compiled with jemalloc TokuDB is now a separate package, not part of the server RPM (because TokuD... [16:21:12] 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) I have also tried to pass it via `[mysqld_safe]` using `malloc-lib=jemalloc` instead of using the systemd unit but it doesn't make any difference. There's a bug (not exactly the same) filed with ma... [16:37:27] 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Ok, so this works: ` root@db2093:/opt/wmf-mariadb104# ./bin/mysqld_safe --malloc-lib=/lib/x86_64-linux-gnu/libjemalloc.so.2 200331 16:35:36 mysqld_safe Adding '/lib/x86_64-linux-gnu/libjemalloc.so.... [16:39:06] 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Looks like that for the hosts that will require TokuDB (tendril and analytics dbstore) we need to find a way to use jemalloc on start if we want to keep using TokuDB [16:47:12] 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Related upstream links: https://github.com/jemalloc/jemalloc/issues/937 https://jira.percona.com/browse/PS-4393 https://jira.mariadb.org/browse/MDEV-15034