[03:19:27] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230641 (10Papaul) [03:30:20] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230643 (10Papaul) [03:35:48] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230644 (10Papaul) a:05Papaul>03Marostegui @Marostegui it is all yours. The only thing left to do is to add both servers into racktables. I a... [05:06:30] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10MW-1.32-release-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)), 10Patch-For-Review: Clean up indexes of wb_terms table - https://phabricator.wikimedia.org/T194273#4230677 (10Marostegui) Progress of the re-additon of the `tmp1` index: [] codf... [05:47:16] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230744 (10Marostegui) a:05Marostegui>03Papaul The RAID isn't done apparently: ``` root@db2095:~# megacli -LDPDInfo -aAll Adapter #0 Numbe... [08:16:12] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4227657 (10jcrespo) db1065 storage has been rebuilt and data cloned to it again. However, there is a smart error on the second disk (I think it is #1, as it starts from 0). We need a repla... [08:49:48] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4230856 (10Marostegui) [09:09:51] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414#946067 (10mark) [09:22:21] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414#4230912 (10daniel) This being out of whack eventually caused a site outage {T195520}. [10:51:34] o/ [10:51:52] o/ [10:55:07] you were talking earlier about some monitoring stuff? "it gives per-file, per query, per user, etc metrics" [10:55:15] is that just tendril stuff? or is there more stuff in other placeS? [10:55:46] No, it is performance_schema [10:56:00] But it is better to use sys schema, as it is easier to query [10:56:15] so on the s8 master I'll find all of these things? [10:56:28] *slaps self* heh, doesn't need to be the master... [10:56:35] yeah, better to use a slave [10:56:40] yup [10:56:52] On any slave, go to the sys schema [10:56:53] Don't know why for some reason my brain thought it would only be on the master for a second [10:57:04] And do a show tables [10:57:21] For instance: select * from statements_with_full_table_scans [10:58:57] hmm, the syshmm, the sys schema? [10:59:02] yeah [10:59:04] "use sys" [10:59:20] Not sure if you have access though [10:59:25] I guess "wikiadmin" can access this [10:59:31] ack, okay, yes I guess I don't [10:59:42] not sure if there are only other db creds I have [11:00:02] You probably don't have access I think [11:00:07] I will check later [11:00:21] okay! [11:05:52] marostegui: would data in there then for example be able to list me every query in the past 7 days hitting wb_terms? or not? [11:06:26] It aggregates per digest [11:06:34] "per digest" meaning? [11:06:44] let me give you an example [11:06:47] :) [11:09:46] https://phabricator.wikimedia.org/P7154 [11:10:15] the aggregations happen since the server got started, or the stats resetted [11:10:36] the default there is order by total latency [11:10:51] and you can see the whole queries on the non-human friendly tables [11:17:33] okay [11:18:55] and things to look out for there are the latencies [11:19:42] I notice that the top result there is the changes table not the term table? with max latency of 34.40 s ? [11:19:43] well, that depends, that is a tool- how to use it depends on what you need [11:19:50] okay [11:20:00] yes, we kill queries longer than 60 seconds [11:20:28] note that is aggregated by total latency [11:20:43] some things, like heartbeat happen very frequently [11:20:48] but take microseconds [11:20:54] okay [11:21:17] you can order by max/average, etc., check for full table scans, etc. [11:21:26] those are just views of performance_schema [11:21:33] okay [11:21:40] later you can view by io [11:21:43] by file [11:21:45] by query [11:21:51] by user, etc. [11:21:56] what are the chances of us getting some sort of access to this? [11:22:06] you should have access to that [11:22:20] the problem is we were told that was private data [11:22:32] and couldn't export it to public grafana [11:22:42] as that could expose both passwords/private data [11:22:46] The only way I know how to access the dbs currently doesn't reveal that schema to me [11:22:46] but also user usages [11:22:50] for small wikis [11:23:11] which is just for example addshore@tin:~$ sql wikidatawiki [11:24:39] "to retrieve from performance_schema tables, you must have the SELECT privilege." [11:25:08] one sec [11:25:13] mysql:wikiadmin@db1104 [wikidatawiki]> use sys; [11:25:13] ERROR 1044 (42000): Access denied for user 'wikiadmin'@'10.%' to database 'sys' [11:25:40] if you have deployment access it means you signed a NDA, right? [11:25:45] yup [11:26:02] access db2083 in a second [11:26:13] (not yet, or you will have to enter and log out) [11:26:23] but maybe you can check first you can go there? [11:26:26] ssh or? [11:26:41] sql command should be ok? [11:26:46] ack *tries* [11:27:01] we have ladsgroup access there, I think [11:27:06] gave [11:27:10] for testing [11:27:12] with which creds? [11:27:23] so maybe it was different account [11:27:31] let me try somwhere else [11:27:49] login into db2082 [11:27:53] can you? [11:28:04] I can sql to it [11:28:12] addshore@tin:~$ sql wikidatawiki -h db2083 [11:28:58] one sec [11:29:04] :) [11:29:37] it is not easy, there are several accounts different [11:29:42] depending on where you connect from [11:30:38] ooh, i see it now [11:30:48] log out and log in [11:31:05] tell me what you see, but do not paste it anywhere [11:31:26] so I can see both the sys and perf_schema schemas [11:31:32] as I said, that is considered as private as hashed passwords [11:31:40] yup [11:31:51] play around a bit with it, and if everyhing is ok [11:31:55] I will deploy to all hosts [11:32:11] querying performance schema is mostly lockless [11:32:11] while trying to run your example query I get: [11:32:19] ERROR 1356 (HY000): View 'sys.statement_analysis' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them [11:32:20] so it should be ok in most cases [11:32:26] oh [11:32:29] it may need more things [11:32:50] as sys also queries processlist, which requries process [11:33:03] I am not sure about that [11:33:14] but you should be able to query performance schema directl [11:33:16] let me see [11:33:58] can you do SELECT * FROM events_statements_summary_by_digest; [11:34:04] on performance_schema ? [11:34:09] yes [11:34:09] it is essentially the same thing [11:34:14] but uglier [11:34:30] there is probably not much ongoing there [11:34:54] because it didn't receive production traffic since it started [11:35:12] all those check are done on upgrade [11:35:37] the idea is to put all those things on a private grafana instance [11:35:42] to aggregate them [11:35:48] sound glorious [11:35:50] *sounds [11:36:00] but private data requires a separate installation [11:36:06] and it is real time [11:36:08] okay, so 52 tables, which one essentially gives me the same data as in statement_analysis? [11:36:17] yeah [11:36:24] sys source is public [11:36:32] sometimes it may join with processlist [11:36:40] which would require more grants [11:36:45] but I am not sure those are ok [11:36:51] for the admin user [11:37:10] we could try making an unprivileged view [11:37:11] try [11:37:47] the problem is that it joins with information_schema [11:37:49] for metadata [11:38:06] and you only have access for the objects you can read [11:38:17] so it requires a slightly different query [11:38:55] you can see the views at https://github.com/jynus/mysql-sys/tree/master/views/p_s [11:39:15] schema_index_statistics may be nice [11:39:28] it is mostly performance_schema.table_io_waits_summary_by_index_usage [11:40:02] and statements_with_errors_or_warnings [11:40:22] which is events_statements_summary_by_digest WHERE SUM_ERRORS > 0 OR SUM_WARNINGS > 0 [11:40:40] play a bit with it, you can run queries and see them appearing [11:41:15] I think most of sys that doesn't touch information_schema should work [11:41:22] okay [11:41:27] and those that don't work [11:41:38] we can do regular dumps to tendril or something [11:42:05] or if it is a one-time thing, I can get those for you [11:42:14] but let's be confident with the tool [11:42:17] for example [11:42:23] the unused indexes [11:42:30] may give a lot of false positives [11:42:38] on a depooled replica, does it make sense? [11:42:50] so, schema_index_statistics also doesnt directly work for me [11:42:58] jynus: aaah yes, that makes sense [11:43:11] or slow vs api, etc. [11:43:31] we dont have the aggregations tools yet [11:43:32] so performance_schema and sys are per db? not 1 schema replicated on the shard? [11:43:37] yeah [11:43:39] okay [11:43:47] that is indeed important to know :) [11:43:52] it it is both [11:43:58] per instance and per db [11:44:04] here db == schema, wiki [11:44:13] ack! [11:44:22] that is why grafana is nice [11:44:37] but we had to turn it off for privacy [11:44:38] so looking at https://github.com/jynus/mysql-sys/blob/master/views/p_s/schema_index_statistics.sql it looks like it should work for me? [11:44:55] what error do you get? [11:44:58] it doesnt touch information_schema or processlist [11:45:01] same error as before [11:45:28] could be some issue with sys.format_time ? [11:45:40] and what if you copy and paste from line 52? [11:45:51] aka running the select directly? [11:46:02] ERROR 1370 (42000): execute command denied to user 'wikiadmin'@'10.%' for routine 'sys.format_time' [11:46:06] :) [11:46:08] ah [11:46:16] ok, this could be easier [11:46:20] I can give you exec [11:46:22] but also [11:46:31] try selecting the x$ of the same name [11:46:51] x$ ? [11:46:53] SELECT * FROM x$schema_index_statistics [11:47:01] it is the machine-readable version [11:47:08] ack, *tries* [11:47:09] for non-humans [11:47:16] but i am human ;) [11:47:16] I think I can fix it [11:47:20] but try that [11:47:33] yup x$ works [11:47:42] it is exactly the same [11:47:43] i guess that just doesnt have pretty dates? [11:47:46] but without the formatting [11:47:48] yep [11:47:56] I will try to make the other work, however [11:48:05] okay, is there somewhere on wikitech that should have docs for this? I can help to populate them [11:48:06] let's create a ticket [11:48:24] and you can tell everthing that doesn't work at once [11:48:34] instead of just pinging me here every time [11:48:46] ack! [11:48:54] the problem is that some stored procedures/functions may insert [11:49:03] so I need to check each one [11:49:06] okay [11:49:45] there is also some tunning needed [11:49:59] p_s take 1 GB of memory with our config [11:50:09] but may need more memory on s3, with so many objects [11:50:14] that stats start to be discarded [11:50:41] I put the more interesting queries on my slides [11:51:27] unused indexes, query digests and duplicate indexes [11:51:43] did you saw my presentation? [11:53:23] addshore: log out and log in and try schema_index_statistics, without the x$ [11:53:38] ack [11:53:41] and the other one [11:53:46] it should work now, I think [11:54:12] yup, they both work [11:54:19] awesome! [11:54:22] so for human usage [11:54:25] the bare ones [11:54:33] for machines, the x$ ones [11:54:37] e.g. a script [11:54:54] amazing, this has been a very productive 20 mins! [11:54:56] play with it and when you are happy [11:55:07] ask on a ticket to deploy everywhere [11:55:07] is there somewhere we should document this on wikitech? [11:55:11] ack [11:55:26] i'll have a dig into this today then, and file a ticket for you later [11:55:27] I will do it when we deploy everywhere [11:55:39] but the usage is really not something wikimedia-related [11:55:43] it is bare mysql [11:55:58] like default mysql / mariadb [11:59:12] yup, okay [11:59:26] did you see my slides on query optimization? [11:59:28] they are long [11:59:41] but the section about profiling is just 5 slides long [12:02:19] https://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/158 [12:03:32] in particular https://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/166 [12:07:28] I did not, but I will look! [12:18:31] 10DBA, 10Performance: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers) - https://phabricator.wikimedia.org/T195578#4231280 (10jcrespo) [12:19:03] 10DBA, 10Performance: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers) - https://phabricator.wikimedia.org/T195578#4231290 (10jcrespo) p:05Triage>03Normal [12:52:13] I don't remember exactly how the DB groups in mediawiki work in relation to the actuall instances [12:52:51] would moving all queries for the wb_terms table to a single instance / pair of instances be a good or bad idea? [12:53:05] I guess if they would then overload them with connections, the rest of s8 would remain up? [12:53:23] but perhaps that would just lead them to be more likely overloaded? [12:54:23] in general, mediawiki works by sending queries to other hosts if no specific group are available [12:54:35] or at least it used to [13:27:12] so much data in those tables it is hard to understand it all :D [13:28:17] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231426 (10Papaul) [13:29:36] you don't need all for simple usage [13:29:43] https://youtu.be/Kr3B7wNJi0g?t=2m12s [13:30:06] oooh a video! :) [13:32:16] interesting part starts at https://youtu.be/Kr3B7wNJi0g?t=29m42s [13:45:10] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231455 (10Papaul) [13:51:03] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231464 (10Marostegui) db2094 looking good! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [13:57:44] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231513 (10Papaul) a:05Papaul>03Marostegui @Marostegui All done [14:00:28] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [14:00:30] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231526 (10Marostegui) 05Open>03Resolved db2095 looks good now! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [14:02:38] jynus: just one note, term_search_key has not been used at all and if you look into the offending queries it's not being read, but obviously replacing it with empty string can bring down database by *just* changing execution plan [14:02:52] of certain queries [14:03:12] does it make sense to you? [14:05:05] I've been saying that it could be not related to the index [14:05:20] and just a change in strantegy because table changes [14:05:46] or that the index (which may force an analyze) could be the trigger but not the cause [14:06:36] it makes way more sense than the index change because it explains the timeline [14:07:07] I don't know, but people has been focusing too much on the index [14:07:12] the index has been replaced for a while and if it supposed to bring down the db, it would've done way sooner [14:07:13] when there were many changes to the table [14:07:33] and any of them could have cause a sudden query plan change [14:08:21] yeah [14:13:58] https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20180524-wikidata&diff=1792770&oldid=1792738 [14:14:01] Added this [15:11:38] 10DBA, 10Operations: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231733 (10Marostegui) [15:13:05] 10DBA, 10Operations: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231749 (10Marostegui) p:05Triage>03Normal [15:31:14] 10DBA, 10Operations, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231814 (10Marostegui) [15:36:51] 10DBA, 10Operations, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231822 (10Marostegui) [15:47:19] 10DBA, 10Operations, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231733 (10jcrespo) [15:47:52] 10DBA: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598#4231846 (10jcrespo) [15:48:32] jynus: db2042 or db2045? or both? [15:49:40] mmm [15:50:20] one is sure [15:50:24] the other wasn't checked [15:50:49] I think you said db2045 originally [15:50:54] let me check it [15:51:00] 45 is different [15:51:05] 42 I checked it now [15:51:09] and it is not [15:51:09] ok [15:51:27] 10DBA: Differing database server ips and server_id numbers - https://phabricator.wikimedia.org/T195598#4231887 (10Marostegui) [15:51:33] db2045 is an easy one, can be done next week [15:51:56] so it may actually be useful to setup such a check [15:52:11] also to test the lists are working [15:52:14] but needs more work [15:52:29] I created the parent task mostly to document it [15:52:32] yeah [15:56:33] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231916 (10Marostegui) [15:57:41] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [16:20:21] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#4231991 (10Marostegui) [16:20:26] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231989 (10Marostegui) 05stalled>03Open The definitive hardware for eqiad is now in place and replicating: db1124: s1, s3,... [16:21:13] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231995 (10Marostegui) [16:28:02] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4232018 (10Marostegui) Is the scope of this task finished? [16:28:42] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4232019 (10jcrespo) See T195444#4230827 [16:29:18] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4232025 (10Marostegui) Ah, thanks! missed it :) [18:51:31] 10Blocked-on-schema-change, 10DBA, 10ContentTranslation, 10ContentTranslation-Deployments, and 2 others: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407#4232426 (10Petar.petkovic) [20:45:35] 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4232792 (10Marostegui) [20:45:50] 10DBA, 10Operations, 10ops-codfw: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4232804 (10Marostegui) p:05Triage>03Normal [21:18:06] hey [22:40:59] 10DBA, 10Operations, 10Regression: MySQL prompt missing trailing space on terbium - https://phabricator.wikimedia.org/T195636#4233115 (10Reedy) [22:41:03] 10DBA, 10Operations, 10Regression: MySQL prompt missing trailing space on terbium - https://phabricator.wikimedia.org/T195636#4233125 (10Reedy) p:05Triage>03Low