[02:59:22] <wikibugs>	 10DBA, 10MediaWiki-API, 10Patch-For-Review: API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID - https://phabricator.wikimedia.org/T216656 (10Aklapper)
[06:03:30] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10Marostegui) 05Open→03Resolved All good now, thank you! `       logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) `
[06:34:45] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) No, those are all the logged queries involving the logging table that where logged on sys, that's w...
[06:56:14] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui)
[07:01:50] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui)
[08:36:16] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui)
[08:37:24] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) I have ran the ALTER on codfw DC for s5 section: ` cebwiki dewiki enwikivoyage mgwiktionary shwiki srwiki `  Next week I am going to start with eqiad, I will go slow...
[08:52:34] <_joe_>	 hey dbas
[08:53:14] <_joe_>	 how would you feel if the parsercache was to be used as a more general k/v "semi-persistent" storage for other things by mediawiki?
[08:53:30] <_joe_>	 specifically, are the parser caches replicated cross-dc?
[08:54:18] <marostegui>	 what do you mean replicated cross-dc?
[08:54:22] <jynus>	 well, I would say we haven't provisioned by more things that the current pc
[08:54:39] <jynus>	 but if you mean a separate set of pc servers, that wouln't be an issue
[08:55:18] <jynus>	 yes, pc can work write-write between multiple dcs, but with eventual consistency model
[08:58:36] <jynus>	 I have also created a more general key-value store prtotype however: https://github.com/jynus/pysessions
[09:03:08] <_joe_>	 the context of my question is https://phabricator.wikimedia.org/T214362
[09:03:48] <_joe_>	 jynus: we have a generic k-v interface already, which supports cassandra as a backend
[09:04:03] <marostegui>	 _joe_: but I think we'd need some other servers and not the current parsercache ones
[09:04:23] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) I am creating a snapshot right now for testing purposes, will run a dumping process next.
[09:04:28] <_joe_>	 marostegui: uhm, ok. So it's a logical problem more than a capacity problem
[09:04:49] <_joe_>	 as in, not doing multitenancy on those servers.
[09:08:35] <jynus>	 so the parsercache is persisted memcached, if the goal is to avoid memcache, probably parsercache is not the right place
[09:09:41] <jynus>	 the other thing that doesn't match is the purge process, parsercache, unless on special circunstances, doesn't change, ever
[09:10:14] <jynus>	 (e.g. a parsing of a revision never changes, only new revisions are created on top)
[09:10:46] <jynus>	 here, however, only wants to store the latest version
[09:11:24] <jynus>	 which means usually, a single item will be stored and made invalid multiple times quickly
[09:11:58] <jynus>	 So can we store it on mysql- sure, but I would avoid the current parsercache servers
[09:12:33] <jynus>	 it is difficult to say a recommendation without knowing better, more accurate metrics
[09:13:27] <jynus>	 (what is the "parsing time", average ttl of a metric, throughput, invalidation, etc.
[09:15:34] <jynus>	 if storing 52 million rows, that could merit its own couple of servers for its own "wikidata quality contraints parsercaches"
[09:16:13] <jynus>	 but again, difficult to say without knowing what latencies are expected
[09:16:34] <jynus>	 or behaviour on mass-eviction
[09:20:35] <_joe_>	 I don't think mass-eviction is an issue
[09:21:20] <_joe_>	 i would've even suggested to add a couple simple tables to the wikidata db (on s8?), but apparently the data in the value can be quite large
[09:21:39] <_joe_>	 I think they're overcomplicating the problem, frankly
[09:22:03] <_joe_>	 but, daniel kinzler made an interesting observation
[09:22:21] <_joe_>	 we're seeing the need for storing binary blobs linked to pages, not revisions
[09:22:21] <jynus>	 wikidata cannot be understimated
[09:22:25] <_joe_>	 multiple times
[09:22:38] <jynus>	 parsercache is linked to revisions, for the most part
[09:23:07] <jynus>	 blobs linked to pages usually are stored on page_properties table
[09:23:12] <_joe_>	 so, we might want a k/v storage for pages. One obvious solution is cassandra+kask
[09:23:15] <jynus>	 but of course, only small ones
[09:23:25] <_joe_>	 no I mean large-ish blobs
[09:25:56] <jynus>	 I don't know what kask is
[09:27:32] <_joe_>	 the new service for sessions storage
[09:27:55] <_joe_>	 backed by cassandra, it's basically a lean k-v interface
[10:12:16] <jynus>	 taking a break while backup is ongoing
[11:02:27] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui)
[11:03:50] <jynus>	 2:30 hours of backup and going strong...
[11:05:24] <marostegui>	 on db1114?
[11:08:03] <jynus>	 yes
[11:08:18] <jynus>	 we may have to stop replication, as without it it takes only 1:30
[11:08:21] <marostegui>	 did you see my suggestion? maybe we can just leave it running all day long
[11:08:26] <marostegui>	 to see if it crashes
[11:08:41] <jynus>	 plus with replication prepare takes a lot of time
[11:08:51] <jynus>	 yes, I intended to do that later
[11:08:58] <marostegui>	 great
[11:09:04] <marostegui>	 and of course…it won't crash
[11:09:07] <jynus>	 but I needed to test snapshoting first "quickly"
[11:10:30] <marostegui>	 of course!
[11:12:52] <jynus>	 going for a coffee because no reason to keep looking at the counter
[11:13:05] <marostegui>	 :)
[11:13:27] <jynus>	 when I returns, if it keeps going, I will stop replication
[11:13:47] <jynus>	 oh, and as I type that, it is finishing now
[11:13:57] <jynus>	 2:40, latest steps
[11:14:17] <jynus>	 done, will prepare andrun mydumper later
[11:14:38] <jynus>	 the good thing is that even if it takes a lot of time, it is consistent with the end of the backup
[11:55:08] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui)
[11:58:32] <jynus>	 DEBUG:backup:['/opt/wmf-mariadb101/bin/mariabackup', '--prepare'
[11:58:54] <jynus>	 I will now set up the mydumper on its own host, I guess
[11:59:26] <marostegui>	 yeah, I thought just dumping it + deleting the directory of the dumps + mydumper + deleting etc…all locally of course
[11:59:48] <jynus>	 let me do it, as I need to test the changes to the logical dumps
[12:00:06] <jynus>	 we need to talk because as I have changed some things and a)
[12:00:28] <jynus>	 they are not intuitive, but makes sense if the details are known
[12:00:44] <jynus>	 and b) you may give me more ideas on how to do those better
[12:01:42] <marostegui>	 yep
[12:01:42] <marostegui>	 feel free to do it yourself, it was a suggestion :)
[12:01:42] <marostegui>	 I am going to go for lunch
[12:08:06] <jynus>	 I wonder if we should make our "test-s1" database a sort of "CI" / production testing for backups
[12:08:32] <jynus>	 it is a bit silly question, as it was already setup for backup testing
[12:49:27] <wikibugs>	 10DBA, 10MediaWiki-API, 10Core Platform Team Kanban (Done with CPT), 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26): API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID - https://phabricator.wikimedia.org/T216656 (10Anomie) 05Ope...
[13:48:54] <marostegui>	 jynus: yeah, it was bought for that, sort of :)
[13:50:18] <jynus>	 So apparently tls configuration breaks mydumper https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492309/
[13:52:06] <marostegui>	 what?
[13:52:26] <marostegui>	 ah, locally?
[13:52:51] <jynus>	 yeah
[13:53:14] <jynus>	 and I cannot make mydumper ignore it, it reads my.cnf without being able to disable it
[13:53:36] <jynus>	 at leasy percona toolkit I can do F=/dev/null
[13:54:11] <jynus>	 so I just will resign on having to do mysql --skip-ssl
[13:54:49] <jynus>	 but we can enforece it on accounts (replication, remote administration, etc.)
[14:18:08] <jynus>	 so the good news is that backups seem to be, for the most part working
[14:18:34] <jynus>	 it needs some extra testing to check all combinations with production-like sizes
[14:18:47] <marostegui>	 what do you mean?
[14:29:12] <jynus>	 "things are looking good, (light at the end of the tunnel) more work needed"
[14:29:35] <jynus>	 I still need a single script from cumin to do all stuff automatically
[14:29:45] <jynus>	 and testing testing testing
[14:30:06] <jynus>	 do you want to know the details, or better wait until monday?
[14:30:37] <jynus>	 I've left an infinite loop of backups running on db1114
[14:31:03] <jynus>	 to avoid forgetting it, I disabled puppet, which will alert after some time
[14:31:29] <marostegui>	 nice, let's see if it crashes
[14:31:39] <marostegui>	 we know it wont
[14:32:40] <jynus>	 he
[14:33:23] <jynus>	 I will leave it the weekend with such a "realistic" load
[14:33:32] <jynus>	 we can do a mem stress test afterwards
[14:34:13] <marostegui>	 sure, or even leave it for a week XD
[14:41:02] <jynus>	 oh, I caught a bug
[14:41:14] <jynus>	 gathering statistics works
[14:41:46] <jynus>	 but because dumps only genereates files and not subdirs, snaphosts only got the depth=0 objects
[14:42:09] <jynus>	 e.g:| mysql                        |       4096 | 2019-02-22 11:13:38 |             NULL |
[14:42:50] <jynus>	 I may need to reimplement gathering statistics to do recursive gathering, and it may need a new path-based structure on the database
[14:43:04] <marostegui>	 on the database?
[14:43:31] <jynus>	 the backup_files has a file_name
[14:43:37] <jynus>	 it may need a path or something
[14:44:09] <jynus>	 e.g. to differenciate eswiki/revision.ibd from metawiki/revision.ibd
[14:44:18] <marostegui>	 aaah I see
[14:44:28] <jynus>	 but the basic idea works
[14:44:55] <jynus>	 it is now finishing compressing, if it rotates properly
[14:45:01] <jynus>	 I will call it for this week
[14:45:13] <marostegui>	 :)
[14:45:34] <jynus>	 40 more days of work left
[14:45:59] <jynus>	 also I will need a way for transfer to decompress a tar.gz
[18:36:34] <jynus>	 we finally have a "snapshot.s1.2019-02-22--08-31-30.tar.gz" on dbstore1001:/srv/backups/snapshots/latest
[18:36:38] <jynus>	 and it only took 12 hours!
[18:37:15] <jynus>	 we may have to stop replication to lower it to a manageable state
[18:37:24] <jynus>	 and purchase an ssd for the compression step
[18:39:16] <marostegui>	 nice!!!
[18:39:28] <marostegui>	 That is a good proof that we do need those pair of SSDs!
[18:40:08] <jynus>	 on the other side, all the work we do
[18:40:20] <jynus>	 is to reduce the TTR to probably less than 1 hour
[18:40:37] <marostegui>	 yeah, pretty much the transfer somewhere else
[18:40:47] <jynus>	 so backup took 2:40
[18:40:55] <jynus>	 but it was to an HD on dbstore1001
[18:41:11] <jynus>	 and replication may make double it
[18:41:38] <jynus>	 prepare too 1:20
[18:41:41] <jynus>	 *1:30
[18:41:55] <jynus>	 and compression 1:30
[18:42:48] <marostegui>	 so what took the 12h then?
[18:43:05] <jynus>	 I was doing other stuff in between, it is not fully scripted
[18:43:08] <jynus>	 :-)
[18:44:03] <jynus>	 transfer.py --type=xtrabackup and dump_section.py --only_posprocess are the only fully automated ones
[18:44:11] <jynus>	 plus I was fixing bugs on the fly
[18:45:00] <marostegui>	 aaaaaah :)
[18:45:02] <marostegui>	 sorry!
[18:45:06] <marostegui>	 Misunderstood you
[18:45:19] <jynus>	 for example, I used dump_section.py for the db1114 stress test
[18:45:37] <jynus>	 and that meant fixing bugs I introduced
[18:46:20] <jynus>	 also I met with bd808, which told me we are doing such a bad job he is going to hire an extra dba
[18:46:38] * bd808 grins
[18:46:45] <bd808>	 yeah, that's the story
[18:46:56] <marostegui>	 Sorry, I wasn't asking you to tell me why it took so long, I understood it took 12h because some of the processes took long, and I was curious about which one
[18:46:59] <jynus>	 that is what I got from the meeting :-P
[18:47:03] <jynus>	 my fault
[18:47:25] <jynus>	 although it was not fast
[18:47:43] <jynus>	 5 hours for a backup not really that good
[18:48:08] <jynus>	 replication + maybe increasing the buffer pool may help
[18:48:18] <jynus>	 if only it was configurable!
[18:48:55] <marostegui>	 :p
[18:49:10] <jynus>	 would you want to try it?
[18:49:25] <jynus>	 so I force you to read my ugly code?
[18:50:55] <marostegui>	 yeah, I would like to try
[18:51:01] <marostegui>	 I am reviewing your code too! :p
[18:51:04] <marostegui>	 Let's chat next week
[18:51:06] <marostegui>	 I am off now!
[18:51:11] <marostegui>	 Have a nice weekend :)
[18:51:12] <jynus>	 bye!
[18:51:14] <jynus>	 same