[02:59:22] 10DBA, 10MediaWiki-API, 10Patch-For-Review: API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID - https://phabricator.wikimedia.org/T216656 (10Aklapper) [06:03:30] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10Marostegui) 05Open→03Resolved All good now, thank you! ` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) ` [06:34:45] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) No, those are all the logged queries involving the logging table that where logged on sys, that's w... [06:56:14] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [07:01:50] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) [08:36:16] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) [08:37:24] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 (10Marostegui) I have ran the ALTER on codfw DC for s5 section: ` cebwiki dewiki enwikivoyage mgwiktionary shwiki srwiki ` Next week I am going to start with eqiad, I will go slow... [08:52:34] <_joe_> hey dbas [08:53:14] <_joe_> how would you feel if the parsercache was to be used as a more general k/v "semi-persistent" storage for other things by mediawiki? [08:53:30] <_joe_> specifically, are the parser caches replicated cross-dc? [08:54:18] what do you mean replicated cross-dc? [08:54:22] well, I would say we haven't provisioned by more things that the current pc [08:54:39] but if you mean a separate set of pc servers, that wouln't be an issue [08:55:18] yes, pc can work write-write between multiple dcs, but with eventual consistency model [08:58:36] I have also created a more general key-value store prtotype however: https://github.com/jynus/pysessions [09:03:08] <_joe_> the context of my question is https://phabricator.wikimedia.org/T214362 [09:03:48] <_joe_> jynus: we have a generic k-v interface already, which supports cassandra as a backend [09:04:03] _joe_: but I think we'd need some other servers and not the current parsercache ones [09:04:23] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) I am creating a snapshot right now for testing purposes, will run a dumping process next. [09:04:28] <_joe_> marostegui: uhm, ok. So it's a logical problem more than a capacity problem [09:04:49] <_joe_> as in, not doing multitenancy on those servers. [09:08:35] so the parsercache is persisted memcached, if the goal is to avoid memcache, probably parsercache is not the right place [09:09:41] the other thing that doesn't match is the purge process, parsercache, unless on special circunstances, doesn't change, ever [09:10:14] (e.g. a parsing of a revision never changes, only new revisions are created on top) [09:10:46] here, however, only wants to store the latest version [09:11:24] which means usually, a single item will be stored and made invalid multiple times quickly [09:11:58] So can we store it on mysql- sure, but I would avoid the current parsercache servers [09:12:33] it is difficult to say a recommendation without knowing better, more accurate metrics [09:13:27] (what is the "parsing time", average ttl of a metric, throughput, invalidation, etc. [09:15:34] if storing 52 million rows, that could merit its own couple of servers for its own "wikidata quality contraints parsercaches" [09:16:13] but again, difficult to say without knowing what latencies are expected [09:16:34] or behaviour on mass-eviction [09:20:35] <_joe_> I don't think mass-eviction is an issue [09:21:20] <_joe_> i would've even suggested to add a couple simple tables to the wikidata db (on s8?), but apparently the data in the value can be quite large [09:21:39] <_joe_> I think they're overcomplicating the problem, frankly [09:22:03] <_joe_> but, daniel kinzler made an interesting observation [09:22:21] <_joe_> we're seeing the need for storing binary blobs linked to pages, not revisions [09:22:21] wikidata cannot be understimated [09:22:25] <_joe_> multiple times [09:22:38] parsercache is linked to revisions, for the most part [09:23:07] blobs linked to pages usually are stored on page_properties table [09:23:12] <_joe_> so, we might want a k/v storage for pages. One obvious solution is cassandra+kask [09:23:15] but of course, only small ones [09:23:25] <_joe_> no I mean large-ish blobs [09:25:56] I don't know what kask is [09:27:32] <_joe_> the new service for sessions storage [09:27:55] <_joe_> backed by cassandra, it's basically a lean k-v interface [10:12:16] taking a break while backup is ongoing [11:02:27] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [11:03:50] 2:30 hours of backup and going strong... [11:05:24] on db1114? [11:08:03] yes [11:08:18] we may have to stop replication, as without it it takes only 1:30 [11:08:21] did you see my suggestion? maybe we can just leave it running all day long [11:08:26] to see if it crashes [11:08:41] plus with replication prepare takes a lot of time [11:08:51] yes, I intended to do that later [11:08:58] great [11:09:04] and of course…it won't crash [11:09:07] but I needed to test snapshoting first "quickly" [11:10:30] of course! [11:12:52] going for a coffee because no reason to keep looking at the counter [11:13:05] :) [11:13:27] when I returns, if it keeps going, I will stop replication [11:13:47] oh, and as I type that, it is finishing now [11:13:57] 2:40, latest steps [11:14:17] done, will prepare andrun mydumper later [11:14:38] the good thing is that even if it takes a lot of time, it is consistent with the end of the backup [11:55:08] 10DBA, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [11:58:32] DEBUG:backup:['/opt/wmf-mariadb101/bin/mariabackup', '--prepare' [11:58:54] I will now set up the mydumper on its own host, I guess [11:59:26] yeah, I thought just dumping it + deleting the directory of the dumps + mydumper + deleting etc…all locally of course [11:59:48] let me do it, as I need to test the changes to the logical dumps [12:00:06] we need to talk because as I have changed some things and a) [12:00:28] they are not intuitive, but makes sense if the details are known [12:00:44] and b) you may give me more ideas on how to do those better [12:01:42] yep [12:01:42] feel free to do it yourself, it was a suggestion :) [12:01:42] I am going to go for lunch [12:08:06] I wonder if we should make our "test-s1" database a sort of "CI" / production testing for backups [12:08:32] it is a bit silly question, as it was already setup for backup testing [12:49:27] 10DBA, 10MediaWiki-API, 10Core Platform Team Kanban (Done with CPT), 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26): API problem with usercontribs using `rev_user_text` rather than `rev_user`: Only use 'contributions' replica if querying by user ID - https://phabricator.wikimedia.org/T216656 (10Anomie) 05Ope... [13:48:54] jynus: yeah, it was bought for that, sort of :) [13:50:18] So apparently tls configuration breaks mydumper https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492309/ [13:52:06] what? [13:52:26] ah, locally? [13:52:51] yeah [13:53:14] and I cannot make mydumper ignore it, it reads my.cnf without being able to disable it [13:53:36] at leasy percona toolkit I can do F=/dev/null [13:54:11] so I just will resign on having to do mysql --skip-ssl [13:54:49] but we can enforece it on accounts (replication, remote administration, etc.) [14:18:08] so the good news is that backups seem to be, for the most part working [14:18:34] it needs some extra testing to check all combinations with production-like sizes [14:18:47] what do you mean? [14:29:12] "things are looking good, (light at the end of the tunnel) more work needed" [14:29:35] I still need a single script from cumin to do all stuff automatically [14:29:45] and testing testing testing [14:30:06] do you want to know the details, or better wait until monday? [14:30:37] I've left an infinite loop of backups running on db1114 [14:31:03] to avoid forgetting it, I disabled puppet, which will alert after some time [14:31:29] nice, let's see if it crashes [14:31:39] we know it wont [14:32:40] he [14:33:23] I will leave it the weekend with such a "realistic" load [14:33:32] we can do a mem stress test afterwards [14:34:13] sure, or even leave it for a week XD [14:41:02] oh, I caught a bug [14:41:14] gathering statistics works [14:41:46] but because dumps only genereates files and not subdirs, snaphosts only got the depth=0 objects [14:42:09] e.g:| mysql | 4096 | 2019-02-22 11:13:38 | NULL | [14:42:50] I may need to reimplement gathering statistics to do recursive gathering, and it may need a new path-based structure on the database [14:43:04] on the database? [14:43:31] the backup_files has a file_name [14:43:37] it may need a path or something [14:44:09] e.g. to differenciate eswiki/revision.ibd from metawiki/revision.ibd [14:44:18] aaah I see [14:44:28] but the basic idea works [14:44:55] it is now finishing compressing, if it rotates properly [14:45:01] I will call it for this week [14:45:13] :) [14:45:34] 40 more days of work left [14:45:59] also I will need a way for transfer to decompress a tar.gz [18:36:34] we finally have a "snapshot.s1.2019-02-22--08-31-30.tar.gz" on dbstore1001:/srv/backups/snapshots/latest [18:36:38] and it only took 12 hours! [18:37:15] we may have to stop replication to lower it to a manageable state [18:37:24] and purchase an ssd for the compression step [18:39:16] nice!!! [18:39:28] That is a good proof that we do need those pair of SSDs! [18:40:08] on the other side, all the work we do [18:40:20] is to reduce the TTR to probably less than 1 hour [18:40:37] yeah, pretty much the transfer somewhere else [18:40:47] so backup took 2:40 [18:40:55] but it was to an HD on dbstore1001 [18:41:11] and replication may make double it [18:41:38] prepare too 1:20 [18:41:41] *1:30 [18:41:55] and compression 1:30 [18:42:48] so what took the 12h then? [18:43:05] I was doing other stuff in between, it is not fully scripted [18:43:08] :-) [18:44:03] transfer.py --type=xtrabackup and dump_section.py --only_posprocess are the only fully automated ones [18:44:11] plus I was fixing bugs on the fly [18:45:00] aaaaaah :) [18:45:02] sorry! [18:45:06] Misunderstood you [18:45:19] for example, I used dump_section.py for the db1114 stress test [18:45:37] and that meant fixing bugs I introduced [18:46:20] also I met with bd808, which told me we are doing such a bad job he is going to hire an extra dba [18:46:38] * bd808 grins [18:46:45] yeah, that's the story [18:46:56] Sorry, I wasn't asking you to tell me why it took so long, I understood it took 12h because some of the processes took long, and I was curious about which one [18:46:59] that is what I got from the meeting :-P [18:47:03] my fault [18:47:25] although it was not fast [18:47:43] 5 hours for a backup not really that good [18:48:08] replication + maybe increasing the buffer pool may help [18:48:18] if only it was configurable! [18:48:55] :p [18:49:10] would you want to try it? [18:49:25] so I force you to read my ugly code? [18:50:55] yeah, I would like to try [18:51:01] I am reviewing your code too! :p [18:51:04] Let's chat next week [18:51:06] I am off now! [18:51:11] Have a nice weekend :) [18:51:12] bye! [18:51:14] same