[01:54:25] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul)
[01:54:34] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul) p:05Triage→03Medium
[02:02:01] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Andrew) ` mysql:root@localhost [labtestwiki]> alter table page change page_restrictions page_restrictions varbinary(255) DEFAULT ''; Query OK, 27971 rows affected (3.80 sec)...
[05:09:54] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui) Yes! Thank you :)
[05:10:02] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui)
[05:46:22] <wikibugs>	 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui)
[05:49:16] <wikibugs>	 10DBA: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 (10Marostegui)
[05:49:59] <wikibugs>	 10DBA: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 (10Marostegui)
[05:50:02] <wikibugs>	 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui)
[05:50:05] <wikibugs>	 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui)
[05:53:43] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission dbproxy1010.eqiad.wmnet - https://phabricator.wikimedia.org/T248944 (10Marostegui)
[05:55:23] <wikibugs>	 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) >>! In T231520#6011841, @bd808 wrote: >>>! In T231520#6009325, @Marostegui wrote: >> I have also run some queries via Quarry and I have seen th...
[05:59:11] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 (10Marostegui) 05Open→03Resolved a:03Marostegui `nova` and `nova_api` dropped (they had their tables renamed)
[06:50:28] <marostegui>	 there's a warning for backup1001 about disk space on /srv/databases
[07:34:01] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui)
[08:27:59] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui)
[08:28:14] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) p:05Triage→03Medium
[08:36:32] <marostegui>	 for whenever you have sometime https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/584886/
[08:48:35] <marostegui>	 thanks!
[08:53:24] <wikibugs>	 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2093.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003310853_marostegui_...
[09:09:57] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo)
[09:25:17] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2093.codfw.wmnet'] `  and were **ALL** successful.
[09:29:13] <jynus>	 godog: IRC will suffice or prefer videomeeting?
[09:29:44] <godog>	 jynus: irc sounds good to me! I think it'll work
[09:30:55] <jynus>	 So I was checking to see if and how we could backup media files, for some meaning of that
[09:31:11] <jynus>	 well, I was asked to have a look at it
[09:32:05] <jynus>	 but I really have little knowledge of what is there, to have some high level understanding on how that could work, sizes and specially for purchases for next fiscal
[09:32:46] <jynus>	 I saw https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&fullscreen&panelId=9
[09:33:08] <jynus>	 is that after or before redundancy? with or without compression=
[09:33:10] <jynus>	 ?
[09:33:56] <godog>	 that is before redudancy, i.e. the size of containers from swift's point of view
[09:34:15] <godog>	 no compression in swift itself, so yeah nothing like it
[09:34:28] <jynus>	 but I am guessing compression would help very little?
[09:34:43] <jynus>	 if videos and jpgs mostly?
[09:34:45] <godog>	 yeah I think so too, mostly it is already compressed file formats anyways
[09:35:10] <jynus>	 and growth rate has been mostly linear at current rate, does it accelerate?
[09:35:13] <godog>	 the other consideration I think would be which containers to backup, e.g. do we want thumbnails ?
[09:35:26] <jynus>	 ah, is there a percentage in bytes?
[09:35:35] <jynus>	 there is 2 views of that
[09:35:40] <jynus>	 "backup of service"
[09:35:47] <jynus>	 which could include thumbs
[09:35:52] <jynus>	 and "backup of data"
[09:36:19] <jynus>	 thinking on long term survivality
[09:36:51] <jynus>	 there is not concrete plans right now
[09:37:14] <jynus>	 regarding dcs, swift is only on eqiad and codfw?
[09:37:15] <godog>	 I see, yeah you'll see on that dashboard above that in terms of number of files backing up thumbs is a significant effort
[09:37:48] <godog>	 like two orders of magnitude more files
[09:37:55] <godog>	 both codfw and eqiad, active/active
[09:37:55] <jynus>	 well, the how is something I do not want to engage right now
[09:38:05] <jynus>	 that is a project on itself
[09:38:08] <jynus>	 :-D
[09:38:25] <godog>	 haha fair
[09:38:35] <jynus>	 for now I want a better high overview understanding of where we are
[09:38:42] <jynus>	 mostly for hw/needs
[09:38:48] <jynus>	 also if you have thoughts on it
[09:39:08] <jynus>	 e.g. "no reason to perform backups because media is fully online"
[09:39:21] <jynus>	 or "backups would only make sense on a new dc"
[09:39:45] <jynus>	 maybe the how could be coordinated later with dumps generation
[09:40:19] <jynus>	 is there a saner way to perform backups without http api? would it make sense to do raw file backups?
[09:40:48] <jynus>	 or would it make sense to have an additional, offline swift cluster?
[09:41:06] <jynus>	 do you have any existing thought on that?
[09:41:30] <jynus>	 if it helps, here is an overview of database backups:
[09:42:07] <jynus>	 dumps in a separate format that is highly compatible for long term survilibility 
[09:42:29] <jynus>	 and shapshots in the original format for "quick" service recovery
[09:42:46] <jynus>	 the difference vs redundancy is that they are offline- application cannot access the backups
[09:43:51] <jynus>	 e.g. backups are not useful if they cannot be recovered properly :-D
[09:44:13] <godog>	 ok! thanks for the overview, in terms of my thoughts on it: we have quite a bit of redudancy and of course backups of at least originals would be nice to have
[09:44:50] <godog>	 to save us from other disasters like application level troubles that start deleting things
[09:44:52] <jynus>	 but you think those should be in "dump" format, not a filesystem copy of openstack?
[09:45:07] <jynus>	 aka the original files stored somehow?
[09:45:43] <godog>	 yeah I'm thinking the files pulled from the api not from the filesystem
[09:45:44] <jynus>	 maybe with a "magical method" we would offline daily copies somewhere else?
[09:45:57] <jynus>	 magic method to be developped :-D
[09:46:16] <jynus>	 but those would take a lot to recover fully, right?
[09:46:53] <godog>	 yeah for sure, I don't have exact numbers ATM
[09:47:02] <godog>	 on how long that is
[09:47:12] <jynus>	 sure, just to be clear, I am not asking for that, just in case you already knew it
[09:47:39] <jynus>	 anything you could tell me would be useful for trying to provision such work (broadly)
[09:47:47] <jynus>	 what about location?
[09:48:05] <jynus>	 would you think we should go a 3rd site, outside of eqiad and codfw?
[09:48:32] <jynus>	 or an offline copy, eg. close to the dumps would be already very useful
[09:49:00] <jynus>	 what about privacy, would those dumps be public, or we we have to do some separation between public and private media (does that exist?)
[09:49:27] <godog>	 yeah on the 3rd location personally I don't think so, codfw/eqiad seem diverse enough to me
[09:49:30] <jynus>	 e.g. private wiki media, or deleted media
[09:49:49] <jynus>	 is that easy with swift or would it require mw metadata?
[09:50:19] <godog>	 e.g. officewiki is private of course, so yeah mw has the best view on all of that
[09:51:18] <jynus>	 ok, so what we really are thinking here, correct me if I am wrong, is really a media dumping process
[09:51:48] <jynus>	 what about resources, would current swift nodes support such extra reads of originals?
[09:52:09] <jynus>	 or would the cluster need expansion/extra resources?
[09:52:14] <jynus>	 are they on 1G networks?
[09:53:15] <godog>	 nope all 10G, I think reading out the files shouldn't be a big deal, I mean we can do it as slow/fast as we want
[09:53:30] <jynus>	 ok, that facilitates things
[09:54:14] <jynus>	 I will do further research, but I am guessing originals will take around 400TB total?
[09:54:59] <jynus>	 and +100GB every day
[09:55:18] <godog>	 more or less, we're at ~275TB for originals now
[09:55:19] <jynus>	 does does number sound ok, don't need exact figures
[09:55:34] <jynus>	 ah, so thumbs take a lot more than I though
[09:56:02] <godog>	 yeah so that's another whole story, chapter "we don't clean up thumbs ever (yet)"
[09:56:14] <jynus>	 that's ok, not judging!
[09:56:34] <godog>	 but anyways, daily growth for originals is in the 250GB range
[09:57:34] <jynus>	 ok, so I think 550TB would be enough for 3 years I think
[09:57:50] <jynus>	 originals only
[09:58:13] <jynus>	 that's actually duable
[09:58:24] <jynus>	 not sure how, but at least in theory
[09:59:05] <jynus>	 we could partition by wiki (which would be mostly useless)
[09:59:13] <jynus>	 and then by date or something
[09:59:40] <godog>	 yeah I'm not sure how either, agreed by wiki would be mostly commons anyways
[09:59:58] <jynus>	 and for the initial dump we could raw-access metadata on db for fast
[10:00:16] <jynus>	 let's assume we can dump at 5Gbit/s
[10:00:36] <godog>	 one other consideration I think is whether say the internet archive would be interested in hosting a copy, of course of archival files not individual files
[10:00:56] <jynus>	 yeah, public part yes
[10:01:06] <jynus>	 we would have to solve how to send it
[10:01:25] <jynus>	 so we still want to host it
[10:01:37] <jynus>	 maybe restricted first so it can be mirrored
[10:01:44] <jynus>	 then we open it more
[10:01:53] <jynus>	 don't know , that is more "dumps" than backups
[10:02:22] <jynus>	 will have to talk with ariel to see how that could be done
[10:02:57] <godog>	 what's "that" in this context ?
[10:03:12] <jynus>	 the sharing
[10:03:26] <jynus>	 for backups I am interested on the taking and storing :-D
[10:03:33] <jynus>	 and the recovery, of course
[10:03:51] <jynus>	 if it can be shared at the same time with 3rd parties, cool too
[10:04:21] <jynus>	 but my initial scope would be just disaster recovery
[10:05:03] <godog>	 ah ok, got it
[10:05:26] <jynus>	 are all operations with swift done logically at the moment?
[10:05:41] <jynus>	 e.g. when a host fails for any reason, or a new one is added
[10:06:01] <jynus>	 does it join the cluster using swift only?
[10:07:20] <godog>	 if I understand the question correctly yes, an host doesn't exist from swift's POV unless it is in its ring files
[10:07:36] <jynus>	 ok
[10:07:42] <godog>	 and we add/remove hosts from ring files manually
[10:10:12] <jynus>	 ok, with a constant download of 5Gbit/s, it would take only a few days (6) to do a "full copy"
[10:10:23] <jynus>	 which probably would have to be slower
[10:10:32] <jynus>	 but it is not 1 year, which would be impossible
[10:10:56] <jynus>	 I think all this is doable, just very complicated :-D
[10:12:02] <jynus>	 we would also need recovery functionality
[10:12:33] <jynus>	 probably the most common scenario: 1 file was lost (for any reason), recover from backup
[10:13:21] <jynus>	 "insert it back to swift, update metadata"
[10:13:39] <jynus>	 but yes, this looks more like dumping that backups
[10:14:47] <jynus>	 I think that is all, godog, thanks
[10:15:02] <jynus>	 I just needed some extra info, none of this is yet planned
[10:15:18] <jynus>	 but it is planning for planning :-D
[10:17:31] <godog>	 haha! indeed, like office space teaches
[10:17:34] <godog>	 yw jynus 
[10:46:01] <wikibugs>	 10DBA: Create WikibaseQualityConstraints table on commons - https://phabricator.wikimedia.org/T248967 (10Cparle)
[10:48:39] <wikibugs>	 10DBA: Create WikibaseQualityConstraints table on commons - https://phabricator.wikimedia.org/T248967 (10Marostegui) @Cparle table creations are part of normal deployments, they require no DBA intervention: https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change  However, we do need to kno...
[11:54:22] <jynus>	 I just purged the x1 backups on dbprov[12]001
[11:56:42] <marostegui>	 cool
[11:56:43] <marostegui>	 thanks
[14:13:29] <wikibugs>	 10DBA: Create WikibaseQualityConstraints table on commons - https://phabricator.wikimedia.org/T248967 (10Marostegui) As per our chat on IRC - there will be no private data, there, it is safe to let it replicate to wiki replicas. Please comment here once the table is created so I can compress it.  Thank you!
[15:07:58] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) So, this wasn't as easy as it looked like :-) db2093 (and tendril) uses tokudb, and the latest packages we are compiling we do not compile tokudb. I have built a new package for 10.4.12 with tokudb enabled: ` cmake . -D...
[15:16:11] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Definitely related: https://jira.mariadb.org/browse/MDEV-15034
[15:20:03] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) This comment is interesting  ` Permalink mschorm Michal Schorm added a comment - 2019-05-10 03:54 Can be closed.  I finally found a solution.  The "WITH_JEMALLOC" is likethe only option, that does not accept uppercase v...
[15:35:27] <wikibugs>	 10DBA: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) So looks like adding this to the systemd unit works: ` [Service] Environment="LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2" `  ` mysql:root@localhost [(none)]> INSTALL SONAME 'ha_tokudb'; Query OK, 0 rows affected...
[16:01:01] <wikibugs>	 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) From 10.2 (https://mariadb.com/kb/en/changes-improvements-in-mariadb-102/): ` MariaDB is no longer compiled with jemalloc TokuDB is now a separate package, not part of the server RPM (because TokuD...
[16:21:12] <wikibugs>	 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) I have also tried to pass it via `[mysqld_safe]` using `malloc-lib=jemalloc` instead of using the systemd unit but it doesn't make any difference. There's a bug (not exactly the same) filed with ma...
[16:37:27] <wikibugs>	 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Ok, so this works: ` root@db2093:/opt/wmf-mariadb104# ./bin/mysqld_safe --malloc-lib=/lib/x86_64-linux-gnu/libjemalloc.so.2 200331 16:35:36 mysqld_safe Adding '/lib/x86_64-linux-gnu/libjemalloc.so....
[16:39:06] <wikibugs>	 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Looks like that for the hosts that will require TokuDB (tendril and analytics dbstore) we need to find a way to use jemalloc on start if we want to keep using TokuDB
[16:47:12] <wikibugs>	 10DBA, 10Patch-For-Review: Test tendril events on 10.4 - https://phabricator.wikimedia.org/T248957 (10Marostegui) Related upstream links:  https://github.com/jemalloc/jemalloc/issues/937 https://jira.percona.com/browse/PS-4393 https://jira.mariadb.org/browse/MDEV-15034