[01:55:49] 10DBA, 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3200523 (10mmodell) [06:12:46] 10DBA, 07Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190#3200704 (10jcrespo) So we go for full production? [07:33:48] so what are you into now? [07:34:03] I am going to start recloning db1063 [07:34:05] so it is ready for next week [07:34:13] there is an alter going on on db1065 (enwiki) [07:34:13] that is s5? [07:34:16] yes [07:34:27] revision? [07:34:31] yes [07:34:39] Started it before I saw all the discussion [07:34:42] about removing the index [07:34:53] it is ok [07:34:57] btw, I am tempted to decom db2062, give more weight to db2071 and start analyze over the weekend [07:35:07] db2062 ? [07:35:16] api s1 overloaded slaves [07:35:16] you mean depool? [07:35:22] sorry yes [07:35:23] depool [07:35:24] :) [07:35:33] yeah, I was going to do one of those [07:35:37] ah [07:35:39] not sure which one [07:35:39] good :) [07:35:48] either db2062 or 69 [07:35:50] i would sya [07:36:20] say [07:36:21] anything big we should do before the weekend? [07:36:39] analyze on eqiad? upgrades? [07:36:49] Have you seen this [07:36:51] (give me a sec) [07:37:16] https://phabricator.wikimedia.org/T163351#3200731 [07:37:34] yeah, I commented it with tim [07:37:40] ah sorry, missed it [07:37:46] I think it is happening on reboot [07:38:00] did you see my latest comment about it? [07:38:25] https://phabricator.wikimedia.org/T163495#3200807 [07:39:21] 10DBA, 06Operations, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3200825 (10jcrespo) [07:39:29] I would say we close T163351 [07:39:30] T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351 [07:39:40] because the title technicaly is not true [07:39:43] anymore [07:39:45] oh [07:39:45] i didn't see that one [07:39:45] that is interesting [07:40:16] and focus on T163495 and its children [07:40:17] T163495: Mediawiki revision-related queries are failing with high rate for enwiki on codfw - https://phabricator.wikimedia.org/T163495 [07:40:35] creating a ticket per query if necessary [07:41:15] yeah [07:41:16] are you ok with that? (and copy all relevant info that is not yet fixed) [07:41:20] sure sure [07:41:31] makes sense [07:41:35] basically leaving that for the incident [07:41:44] and this new one for the fallout [07:41:54] yes, sounds good to me :) [07:41:58] better organized [07:42:42] basically, adding those last comments to the specific ticket [07:43:08] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190#3200827 (10jcrespo) [07:43:32] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190#553002 (10jcrespo) Tim confirmed ok to deploy. [07:43:59] that is T116557 [07:44:00] T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557 [07:44:31] nice, oct 2016 [07:44:34] 2015! [07:44:42] there was another issue [07:46:11] maybe I should resolve that one and create a new one to avoid confusion [07:46:35] yeah, because it looks pretty similar [07:46:38] at a first glance [07:46:47] at least to someone like me without the context of what happened there [07:47:42] I will leave it there [07:47:50] It is the same scope, and probably related [07:47:59] ok [08:31:48] so I think I am going to focus on 3 things [08:31:56] the ANALYZE [08:32:07] the harder query killer [08:32:19] and I cannot remember the third one [08:32:34] yes, moving dbstore1001 topology [08:32:59] I will do all at once [08:33:12] oh, the harder query killer +10000 [08:33:22] I guess only testing and not deploying on friday no? [08:33:34] no fully deploy, of course [08:33:40] :) [08:33:41] but start with some adnaves [08:33:46] *advances [08:34:24] should we leave an analyze running on db1080 (i think it was that one?) and db1065? from s1, the hosts that got restarted and now do filesort? [08:34:39] ok [08:34:54] I can do it, not asking you to! [08:35:10] I need to see if https://gerrit.wikimedia.org/r/338996 is up to date [08:35:23] let me see [08:35:26] to move dbstore1001 to the new masters [08:35:50] s4 is the only one we changed to so far and it is correct, db1068 is the new s4 master [08:36:10] yeah, the plan [08:36:13] not the current state [08:36:30] you are cloning 63 now, right? [08:36:32] yep [08:36:42] so we have it ready next week [08:36:49] I mean, for next week [08:37:14] so I need to check 61, 62 and 54 [08:37:29] to see if they are the planned ones and still the right ones [08:38:22] db1054 for s2 is the planned one [08:38:28] 62 for s7 [08:38:32] and 61 for s6 [08:38:40] 07Blocked-on-schema-change, 10DBA, 13Patch-For-Review: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563#3200904 (10jcrespo) [08:38:42] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3200902 (10jcrespo) [08:38:43] is the last plan we had on: https://phabricator.wikimedia.org/T162133 [08:38:47] 07Blocked-on-schema-change, 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, and 3 others: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067#3200903 (10jcrespo) [08:38:50] 07Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 06Wikisource, and 4 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#3200905 (10jcrespo) [08:39:02] I will re-review it [08:39:08] cool [08:39:38] good news is that s4 we solved it the hard way- recloning [08:39:47] and s2 and s6 are almost fully checked [08:40:02] yeah, and we even did a switchover before next week, one task less [08:40:38] I think I will change dbstore1001 master in advance [08:40:47] that sounds like a great idea [08:40:51] so I only have to deal with that once [08:41:01] instead of the tedious process every time [08:41:15] none of those new masters need recloning? [08:41:36] s7 is the one that worries me [08:41:52] I am fairly confident about s2 and s6 [08:42:17] if we want to be 100% sure we can clone s7 from the current master? [08:42:18] or [08:42:23] run pt-table checksum [08:42:31] I do not know [08:42:44] why do you have doubts about s7? [08:42:50] something rings a bell? [08:42:52] it is more [08:43:09] we have checked s2 and s6 [08:43:13] and cloned s4 [08:43:29] and I was mostly worried about s4 integrity [08:43:54] s7 is just unknown [08:43:58] we can always mysqldump from the master yes [08:46:02] we can also consider do master upgrades [08:46:09] of kernel, I mean [08:46:31] ah, yes [08:46:38] that would make moritzm happy :) [08:46:53] while they are SPOF in reality it takes less time to reboot them if they had issues [09:00:50] are you touching db1051? [09:00:56] nop [09:01:10] something wrong with it? [09:01:20] I am cleaning up old downtimes/disables I may have setup [09:02:01] ah [09:02:06] I should probably do the same [09:02:33] I think I did all [09:03:34] thanks! [09:04:49] db1040 free space: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=db1040&var-network=eth0 [09:05:29] still around 100G [09:05:42] yeah, no problem storage-wise [09:05:49] more for the alerts [09:06:01] yeah, I would downtime it for the weekend [09:06:07] I think we can change on hiera the theshold [09:06:14] if you are moving dbstore out of it it will have no slaves [09:06:23] so nothing to worry much about other than the data itself [09:06:33] and after checking it we can throw it away and one box less [09:06:36] and 2u more! [09:08:24] you can give a look at puppet/hieradata/hosts/es2001.yaml for other cases [09:09:22] ah [09:09:27] I didn't know that :) [09:10:07] so there would be the place to also change the threshold for dbstore1001 and the timeouts for the slave checks? [09:10:53] well, that works because it is programmed like that [09:11:14] remember the socket thing I always talk about but never do [09:11:28] it is because I had to add that functionality to puppet! [09:11:38] haha [09:13:00] See an example on: https://wikitech.wikimedia.org/wiki/Puppet_coding#A_working_example:_deployment_host [09:14:58] will take a look [09:16:17] I will upgrade kernel and mysql on db2062 so we have a base line to start with [09:16:28] cool, 10.0.30? [09:16:31] yes [09:16:39] ok! [09:16:46] will you run the analyze too? [09:16:49] if it is a problem with the latest version, we want to know [09:16:55] yes [09:16:57] I do not care if it is a problem with older versions [09:17:06] or how to fix it there [09:17:19] yes, upgrade, reboot and analyze [09:17:22] great! [09:17:33] so we start from a common state [09:18:07] BTW [09:18:19] when you upgrade mariadb, remember to run puppet afterwards [09:18:38] as sometimes puppet overwrites package stuch [09:18:48] ok! [09:18:50] right now it is init.d, in the future it could be systemd [09:19:13] I am not going to upgrade db1063, will leave it running 10.0.29 [09:19:18] if we did it right, not writing to the main config, we would not need it [09:19:19] too soon to have a master with 10.0.30 [09:19:48] but not running puppet caused issues in the past (not on dbs) [09:20:21] ok, will do then [10:07:43] loooooooooooooooooooool [10:09:01] http://s2.quickmeme.com/img/9b/9b3ec2629194a2642105cfc73c1d3abfc7e7f4468871d26725b9dbb53786ca94.jpg [10:09:56] 10DBA, 10AbuseFilter, 06Performance-Team, 13Patch-For-Review: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557#3201018 (10jcrespo) This is madness: ``` root@db2062[enwiki]> EXPLAIN SELECT /* AFComputedVariable::{closure} */ rev_user_text FROM `revision` WHERE rev_pa... [10:12:03] now what do I do, do I just repool db1063 and wait if it fixed itself automagically? [10:12:18] *db2062 [10:20:00] mmmm [10:20:09] can you restart mysql on db2062 and see if it happens? [10:20:24] do you have more use cases? [10:20:33] to test it was not just 1 case [10:20:52] not really, but as I tested db1065 it wasn't filesorting and after a restart it did [10:20:58] let's try the same thing with db2062 maybe [10:21:32] it does for rev_page = '17437940' [10:21:52] so it is highly dependent on the value and the internal state [10:22:55] I will run the ANALYZE AS expected [10:23:30] https://phabricator.wikimedia.org/T163351#3200731 [10:23:36] this one filesorts too on db2062 [10:23:40] yes [10:23:43] yeah, I would say let's run it [10:29:23] 10DBA, 13Patch-For-Review, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3201055 (10jcrespo) [11:33:42] 10DBA, 13Patch-For-Review, 07Performance: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns - https://phabricator.wikimedia.org/T160984#3201116 (10jcrespo) Testing https://gerrit.wikimedia.org/r/346559 on db2062. [11:36:05] self reminder: killer events and semisync setup on the new masters [11:49:50] I am going to do some fake queries on db2062 for testing purposes [11:49:55] ok [11:49:59] with wikiuser / wikiadmin [11:50:01] I am going to analyze table on db1081 [11:50:14] to leave it running the weekend [11:50:15] just a heads up it is me but may show up on long running queries monitoring [11:51:17] ok :) [11:52:03] it doesn't work :-( [11:53:52] the killer? [11:55:15] yeah, will be debugging [11:55:59] 10DBA, 13Patch-For-Review: Reclone db1063 to become a slave in s5 - https://phabricator.wikimedia.org/T163109#3201137 (10Marostegui) The scope of this ticket is done. db1063 is now replicating as slave **(that means binlog type = MIXED) so that needs to be changed before making it the master.** SSL and GTID ar... [11:56:07] 10DBA, 13Patch-For-Review: Reclone db1063 to become a slave in s5 - https://phabricator.wikimedia.org/T163109#3201138 (10Marostegui) 05Open>03Resolved a:03Marostegui [11:56:09] 10DBA, 13Patch-For-Review, 05codfw-rollout: Analyze if we want to replace some masters in eqiad while it is not active - https://phabricator.wikimedia.org/T162133#3201140 (10Marostegui) [11:56:39] 10DBA, 13Patch-For-Review, 05codfw-rollout: Analyze if we want to replace some masters in eqiad while it is not active - https://phabricator.wikimedia.org/T162133#3153482 (10Marostegui) db1063 is ready to be the master in s5. **binlog needs to be changed to STATEMENT** [11:56:48] for some reason the event scheduler was disabled there [11:57:13] on db2062? how come? we haven't touched it for months [11:57:32] I know [11:57:53] we need to check with cumin what is the state of that variable [11:58:14] * volans at your service :) [11:58:22] xddd [11:58:27] what do you need? [11:58:40] it works perfectly now [11:58:47] volans: do you have a higlight for the word cumin? XD [11:59:14] marostegui: you'll never know ;) [12:05:43] it was on everywhere except on db2062 (already fixed) [12:05:48] and on db1041 [12:08:04] jynus: Can you tell how much an index is used? [12:08:23] Because we have at least one on wb_terms (yes, that table) which is probably totally useless [12:08:40] It's only there in production, not in hte code itself [12:08:54] tell me the index and the server [12:09:23] wb_terms_entity_type and wb_terms_type (two even) [12:09:29] All s5 servers, wikidatawiki [12:09:34] table? [12:09:38] wb_terms [12:11:18] Speaking of wb_terms, I was thinking about starting with: https://phabricator.wikimedia.org/T162539 [12:11:38] in eqiad so we can get it done by next week on production [12:12:14] That would be really appreciated [12:13:18] https://phabricator.wikimedia.org/P5306 [12:15:05] jynus: tl;dr they're entirely unused? [12:16:01] https://phabricator.wikimedia.org/P5306#28415 [12:16:06] as far as I can see [12:18:26] There are two values in wb_terms_entity_type and three in wb_terms_type [12:18:41] Just to show /how/ useless these are [12:18:50] I mean distinct values [12:19:04] yeah [12:19:15] I also see an unused wikidatawiki | wb_terms | tmp1 [12:19:25] The whole table is just a big mess honestly [12:19:54] tmp1 should really be used on servers that get "regular" traffic [12:21:26] Shall I open a ticket for dropping wb_terms_entity_type and wb_terms_type? [12:21:44] They are not in our table definitions and I have no idea where they come from [12:21:56] That would be nice, so we can alter it along with https://phabricator.wikimedia.org/T162539 [12:25:46] https://phabricator.wikimedia.org/T163548 [12:25:58] I'll try to see whether this was in Wikibase at some point [12:26:06] I really wonder how we screwed up so badly [12:26:47] no reason to ask that [12:26:53] we can only fix things [12:27:04] and look always forward [12:27:05] 07Blocked-on-schema-change, 10Wikidata, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3201218 (10Marostegui) Let's do this alter along with dropping the indexes on: T163548 [12:27:23] not need to blame or ask for guily [12:27:34] talking about guilty [12:27:37] O:-) [12:27:51] I filed [12:28:26] can you be more specific on T163548? [12:28:27] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" - https://phabricator.wikimedia.org/T163548 [12:28:49] which tables and servers: (I do not need the names, just "all production wikidata servers" [12:28:59] or something like that [12:29:17] oh, I assume it has to be deployed first? [12:29:52] anyway, I filed T163544 can you bump that on priority? [12:29:53] T163544: Wikibase\TermSqlIndex::getMatchingTerms , Wikibase\TermSqlIndex::fetchTerms have bad performance after codfw failover - https://phabricator.wikimedia.org/T163544 [12:30:10] to have something done for the 3 May [12:30:20] so we minimize wikidata disruption [12:33:37] jynus: I looked into it… and it's not that easy(tm) [12:33:43] I know [12:33:49] i am not asking for a fix [12:33:51] that's actually how I stumbled upon the indexes [12:33:53] I am asking for a hack [12:33:59] hoo: can you complete a bit the description of T163548 with jynus's suggestions? [12:34:00] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" - https://phabricator.wikimedia.org/T163548 [12:34:06] I said some options like [12:34:13] reducing the limit size [12:34:23] or reducing the amount of queries done [12:34:45] I am not expecting miracles- but the only query that did worse thant those 2 [12:34:58] marostegui: Updated it some more [12:35:00] brought down flow, echo and translations :-) [12:35:22] so I do not expect a fix, just something so we avoid the overload [12:35:40] 10DBA, 10Wikidata, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" - https://phabricator.wikimedia.org/T163548#3201243 (10hoo) [12:35:41] even if it is we disable X for 1 day, whatever it takes [12:36:00] Ugh… this is key Wikidata functionality [12:36:17] It wont hit on a warmed up server [12:36:27] but if the wb_terms indexes are not warm… [12:37:56] hoo: to start dropping those indexes, do we need to wait something from your side? deploy stuff, change tables.sql or something else? [12:38:14] marostegui: No, they're removed from Wikibase since 2014 [12:38:19] and not used in any queries [12:38:34] I am not saying to disable wikidata, it was just an example [12:38:46] I am asking you, can you do something about it? [12:38:48] hoo: great! [12:39:01] technically, I see 300 second queries [12:39:04] jynus: I know, no worries [12:39:08] those should not be there [12:39:09] but I can't think of anything [12:39:12] true [12:39:29] I see clear brute force options [12:39:36] which is reducing batch sizes [12:39:47] If that helps, we can sure do that [12:39:55] that is the only thing I am asking [12:40:03] *he says w/o checking the code first* [12:40:06] to minimize the issues on the next failocer [12:40:13] just give it a look please [12:40:20] Yeah, looking right now [12:40:25] thanks for all your support [12:40:26] but it is time-constrained (next failover is 3 May) [12:41:35] I need to talk to our PM to see how dirty I can go here… but I'll find a way [12:44:12] jynus: For the first query in https://phabricator.wikimedia.org/T163544 how many term_entity_id would be ok? [12:44:22] Right now we don't batch there at all, but I can add that rather easily [12:46:38] Would 10 be ok? [12:49:08] I would say 9 [12:49:18] I thin 10 is the limit to start getting in some cases bad queries [12:49:29] (approximate cardinality) [12:49:38] Ok, that's fine [12:49:41] in most cases [12:49:58] I would assume it is a question of the total number of rows selected [12:50:15] a similar thing to page- some pages have thousands of revisions [12:50:24] and that slows down some queries [13:00:26] jynus: Would it be ok for you, if we were to only temporarily (for the DC switch) apply the fix for the second query in the ticket? [13:00:41] These queries should be replaced with elastic soon-ish anywa [13:00:42] y [13:02:23] 10DBA, 13Patch-For-Review: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079#3201321 (10Marostegui) This was done before the switchover ``` root@db2019.codfw.wmnet[commonswiki]> show create table templatelinks; +---------------+--------------------------------------... [13:02:32] 10DBA, 13Patch-For-Review: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079#3201323 (10Marostegui) 05Open>03Resolved a:03Marostegui [13:08:18] i am doing somethin j*nus will love…cleaning up screen on neodymium, I think I removed like 8 or 9 [13:12:45] 10DBA, 10Wikidata, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3201342 (10Marostegui) a:03Marostegui [14:28:07] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3201504 (10Marostegui) db1067 is done: ``` root@neodymium:~# mysql -hdb1067 -A enwiki -e "show create table revision\G" --skip-ssl *************************** 1. row **... [14:29:21] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3201510 (10Papaul) @Marostegui Anytime Monday at 9:30am works for me. [14:30:45] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3201514 (10Marostegui) >>! In T163339#3201510, @Papaul wrote: > @Marostegui Anytime Monday at 9:30am works for me. Let's do it Monday at 9:30AM then. [14:38:30] jynus: you were only playing with db2062 for the query testing or with db2055 too? [14:38:44] only 62 because it is depooled [14:38:48] ok: https://tendril.wikimedia.org/activity?research=0&labsusers=0 [14:38:50] check the first query [14:38:53] :| [14:39:09] use https://tendril.wikimedia.org/activity?wikiadmin=0&research=0&labsusers=0 [14:39:26] wikiadmin are queries only happening on vslow hosts [14:39:32] (mostly) [14:39:45] ah :) [14:39:45] pheeew [14:39:53] we let those run because only one runs per week [14:39:57] and only on those hosts [14:40:09] * marostegui breaths normally again [14:40:11] and they have a hard limit of 14 hours [14:40:14] *24 [14:40:27] they cannot be done by users [14:41:11] it is not always like that, but simplifying, think wikiuser as webrequests and wikiadmin as internal requests [14:41:32] got it [14:41:34] thanks :) [14:41:57] all those limits are on wikiuser because wikiadmin normally doesn create problems [14:42:10] however, sometimes someone runs preblematic queries manually [14:42:17] and we bark at them :-) [14:42:20] haha [14:42:30] from terbium or tin or equivalent [14:43:41] I am now going to execute 5000 queries into db2062, don't freak out [14:43:47] xdddddd [14:44:10] I think I am going to log off in a bit to be honest [14:44:16] you do well [14:44:47] jynus: Is it ok for you if I quote parts of our conversation from ~1h ago in a ticket? [14:44:54] sure [14:45:03] Thanks [14:45:11] as long as I do not sound too idiotic :-) [14:45:45] for context- high contention on failover [14:45:46] You sure don't [14:46:27] worse offenders: language (temp. disabled), core (working on it) and wikidata [14:46:51] so more people are working on its respective fields [16:57:35] jynus: https://www.percona.com/blog/2017/04/17/the-mysqlpump-utility/ [16:58:03] the parallel processing is a nice one [16:59:07] it requires stopping the slave [16:59:24] I think it does nothing that mydumper or our scripts don't do [17:00:41] what we need is facebook's incremental backups strategy [17:01:16] https://code.facebook.com/posts/1007323976059780/continuous-mysql-backup-validation-restoring-backups/ [17:03:20] yeah, that is the ideal situation [17:04:02] it is not a lack of technology [17:04:08] it is a proper automatization [17:04:52] https://www.youtube.com/watch?v=Fe2oLZ4CWD4 [17:08:58] as dependencies: primary keys everywhere and smaller revision table [19:54:54] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3202432 (10jcrespo) > (I volunteer) Please don't.