[01:26:19] 10DBA, 10Wikidata: Wikidata: Possibility to add comments with more than 255 bytes, potential db issue - https://phabricator.wikimedia.org/T117272#3233942 (10Krinkle) [07:53:48] jynus: yesterday you said"queries can be seen on tendril filtering by db1031" could you walk me through how to get there? :) [07:54:01] got the the main page [07:54:04] *go [07:54:08] yup! [07:54:13] click reports [07:54:28] yup [07:54:35] Slow Queries [07:54:53] host:db1031 [07:55:06] delete user and schema [07:55:45] Great, and I guess set hours to 24? [07:55:51] yes [07:56:16] And db1031 is the master for that shard I assume? [07:56:17] you can click on hits to see the top [07:56:23] of individual queries [07:56:27] yes [07:56:32] that is another issue [07:56:40] db1029 is barely used [07:57:03] and it has more read weight [07:59:45] well, this is a bit off, selectLinkDetailsForPage which appears at the top of db1031's slow query list is marked in mediawiki to use a replica [08:00:33] assuming that is working [08:00:40] then you have a blockage on the master [08:00:57] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234220 (10Marostegui) Position where tempdb2001 slave was stopped at: https://phabricator.wikimedia.org/P5374 Backu... [08:01:06] and this confirms it: https://phabricator.wikimedia.org/T164407#3232480 [08:02:03] Hmm, But I'm confused here as to why a query made using a connection to DB_REPLICA end up running on the master> [08:02:18] we have the master with read traffic [08:02:35] master is always part of the read pool [08:03:02] Interesting, now that I did not know. [08:03:35] in fact it is the only read replica we cannot depool [08:03:54] and if there is lag, it is the only replica that will be used (not really, but you get the idea) [08:04:20] could that lead to issues? [08:04:31] because if it is that, it would be great [08:05:14] is there only 1 master and 1 slave in this shard? [08:05:26] yes, on this datacenter [08:05:48] but this contracts you: https://gerrit.wikimedia.org/r/#/c/351682/2/maintenance/recalculateCognateNormalizedHashes.php [08:06:27] I mean, i guess if the slave got lagged, and then the selects ended up hitting the master I guess the selects would get in the way of the writes and then eventually lock everything up [08:06:38] no [08:06:45] selects do not block writes on innodb [08:06:53] unless they are blocking selects [08:07:04] only writes block other writes [08:07:46] Okay, so there are only 3 selects that run in production [08:07:54] that is excluding the 2 maint secripts [08:07:57] *scripts [08:08:41] no, it is very likely that there are some selects on the master, it has weight 10 [08:09:00] the problem is that there was not overload on the slave, which has weight 20 [08:09:10] so you have a master-only problem [08:09:26] you are blocking your tables or rows [08:09:57] https://phabricator.wikimedia.org/T164407#3232488 [08:10:13] ^that shows queries where blocked waiting [08:10:20] that means contention [08:10:24] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3234243 (10Marostegui) 05Open>03Resolved Thanks @Papaul all the hosts are looking good! I will mark this ticket as resolved then. [08:10:41] maybe your selects got serialized [08:10:49] do you do a lot on the same transaction? [08:10:55] Yup, just trying to figure out which one of the few queries is blocking everything [08:11:07] that is a good question, I cannot answer [08:11:30] do you think I would not tell you "fix this part" if I knew :-D [08:12:00] hehe, I wish it were that easy [08:12:00] and that is why I asked you to investigate when I saw bad patterns before the outage [08:12:26] No, no transactions / big transactions [08:13:38] I wrote to you all the info I had on the comments, maybe with the connection ids + timestamps you can create some patern [08:13:47] plus the code, of course [08:14:15] the table lock looks realy really bad [08:14:22] I do not know where that comes from [08:14:36] innodb never escalates row locks to table locks [08:16:49] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234251 (10jcrespo) "Decommissioned" in the software sense, I think they will want it to return to spares. Just clar... [08:17:07] so I would really start for what could be adquiring a table lock [08:19:05] I am not very sure about this connection manager [08:19:33] I have not looked at it, but it would seem as a recipe to maintain connections open for a long time [08:23:15] So, it shouldn't maintain connections any more than the mw LoadBalancer already does [08:23:30] which is already quite bad :-) [08:23:40] but anyway, offtopic [08:23:47] That's more AaronSchulz's domain ;) [08:28:21] note also the times of the selects [08:28:29] those were all happening at the same second [08:29:25] and they could be within a transaction [08:31:18] oh, I have more info for you [08:32:58] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234269 (10Marostegui) Thanks for the clarification :-) [08:34:20] 10DBA, 10Beta-Cluster-Infrastructure, 06Operations, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3234285 (10hashar) On deployment-tin I created: ``` lang=ini,name=/etc/mysql/conf.d/prompt.cnf [mysql] prompt = "\u@\h[... [08:36:35] I think I know the issue [08:38:20] oooh [08:38:42] see all those inserts and deletes taking 60 seconds? [08:38:48] I kill those [08:39:03] but killing takes some time- it has to revert changes to a propers state [08:39:24] when there are more writes to kill than the script can handle, it gets overloaded [08:39:28] also blocking selects [08:39:35] so forget about the selects [08:39:50] the cause are the writes taking 60 seconds [08:39:58] most likely blocking each other [08:40:04] okay! [08:40:21] there is an infrastructure way to fix those [08:40:41] which is, if you know when contention can happen [08:40:58] is set innodb_lock_wait_time (not sure abut the name) [08:41:05] so, why on earth are these writes taking more than 60 seconds? [08:41:05] so they fail in X seconds [08:41:18] blocking each other, most likely [08:41:36] of course the above is a "bad" fix [08:41:49] the idea is to avoid contention in the first place [08:42:15] but if you do not know how to- just [08:43:31] https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout [08:43:41] and of course, you have to handle the failure [08:43:58] so you were probably having some concurrency issues on writes [08:44:29] either by bad performance (indexes) or by trying to write the same row set at the same time [08:45:30] innodb_lock_wait_timeout is set to 50 seconds [08:46:17] however, it only kills the current statemt [08:46:27] doesn't do a full rollback [08:48:02] however, that doesn't match the metrics [08:48:12] So, there are only 2 single row inserts done in prod code, one on the cognate_titles table and one of the cognate_pages table and one single row delete from the cognate_pages table [08:48:12] the cause seems still to be a full table lock [08:48:36] see: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=19&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=1493816153771&to=1493845145411 [08:48:52] table locks waited are gone after disabling [08:49:24] lock waits are probably caused by other extensions (we know for sure cxtransaction does that, that is what caused the other issue) [08:49:59] addshore it doesn't matter how small the inserts are [08:50:11] what matters is how many rows you lock on doing that [08:50:20] (bad indexes) [08:50:35] and if you lock the same set many times [08:52:15] or do you write and keep the transaction open after that, while you do other stuff? [08:53:31] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3234342 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done: - Removed it from prometheus, added it as a spare on site.pp and removed it from dhcp list:... [08:55:00] So, there are no explicit transactions in the cognate code. [08:55:33] at most 2 inserts happen in a row on a connection (one to the cognate_titles table and one to the cognate_pages table) before the Connection is handed back to the Loadbalancer [08:56:32] I'm wondering if mediawiki has any sort of odd implicit transaction handling built in but I doubt it [08:59:46] oh, it does [09:00:12] not sure if implicit [09:00:17] but it has the mechanism [09:00:42] but the same write on the same transaction to 2 tables is how it should be done [09:03:08] So prior to https://gerrit.wikimedia.org/r/#/c/351682/2 there will also have been a select bundled in on the master prior to the 2 writes (https://gerrit.wikimedia.org/r/#/c/351682/2/src/CognateStore.php) [09:12:46] so I can do some things [09:12:57] I can put the master with 0 load [09:13:06] and I can stop the killing on the master [09:13:10] *s [09:13:25] that will not solve the original problem [09:13:31] but will allow to debug? [09:14:04] and things should be compatible with the previous configuration anyway [09:14:28] addshore you should EXPLAIN the writes [09:14:37] and see if they block a lot of rows [09:14:47] ack! [09:14:52] and also see if a bot can cause large blockages [09:15:00] by writing to similar rows [09:15:32] I would bet it is something like inserting new rows blocks more than it should because it reads other rows [09:16:27] I think I'll also add a readonly mode to cognate, which would enable us to turn off all writes to the db from it instead of the whole extension next time [09:16:40] that would be lovely indeed [09:17:01] or some kind of rate-limiting [09:17:19] these things happen when someone does mass-edits [09:17:25] that way interwikilinks can remain on wiktionaries, even though they will slowly go out of date, but then readonly can be turned off and the maint script run to repopulate the table, might need 1 more maint script to cope with deletes that happened during read only times too [09:17:41] normal flows have been refined to allow for that [09:18:01] but you can be surprised how difficult it is for new workflows to scale [09:18:27] it is multiple clients doing multiple edits on multiple, distributed servers [09:19:59] the thing is- it is a chicken and egg situation [09:20:14] if I kill too much, it is bad, but if I kill too little, it is, too [09:20:31] the only way to solve that is to not use the master at all [09:20:35] Yup! [09:20:59] I am going to pool it at 0 weight [09:21:11] and then I will report all SELECTs I see as bugs [09:21:16] Right, I'm going to look at making the readonly patch in a minute, and also at some of the queries in more detail. [09:21:29] if you do a read only now [09:21:49] we can deploy it now - where now is when releng allows it, I just mean ok for me [09:21:49] Just have to quickly shovel some food in my mouth [09:24:02] We have already merged 3 patches around this issue on the extension, 2 using replicas for selection instead of masters, and 1 trying to release connections as early as possiblre [09:24:19] Would it make sense to debug with the old code first to try to ensure that we have caught the origional issue? [09:25:10] not on production [09:25:25] it would be cool if it had its own server [09:25:28] but it doesn't [09:26:08] need a x2 ;) [09:26:20] *an [09:32:32] i do not like your writes at all [09:32:50] INSERT IGNORE INTO cognate_pages [09:33:12] unique index: cgpa_title,cgpa_namespace,cgpa_site [09:43:15] 10DBA: db1022 broke while changing topology on s6- evaluate if to fix or directly decomission - https://phabricator.wikimedia.org/T163778#3234542 (10Marostegui) [09:57:12] jynus: how can I go about explaining the queries? I cant use the sql.php maint script to get to the master per https://phabricator.wikimedia.org/T164469 [09:57:18] back in 15, meeting [10:24:18] jynus: as I can't explain, what is it that you don't like about the writes? [10:33:50] jynus: can we can an uncliped version of https://phabricator.wikimedia.org/P5376 too so that we can see the values in the queries? [11:13:41] addshore, no, for performance reasons (this log is on the servers themselves), we only store the first 100 bytes [11:14:10] I can take a look at the binlog for the ones that went though [11:14:18] but not the ones that were killed [11:17:55] We are still very confused as to what has caused the table level locks in the first place, as nothing in our code does that, and you say innodb does not escalate row locks to table locks. [11:18:56] 10DBA: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3234884 (10Marostegui) [11:19:56] do you retry failed inserts? [11:21:09] nope [11:22:12] the table-level locks could be me killing 60-second inserts [11:22:38] and be a red hearring [11:22:49] How long would those locks last? [11:23:05] I only saw those table -level locks before, not during the event [11:23:21] I guess the next question is what could cause those inserts / deletes to take 60 seconds in the first place [11:23:27] yep [11:24:12] thing is once queries are started to be killed, probably the query killer cannot keep up [11:29:52] on the other side, table-locks-waited ends at 17, and didn't appear on codfw [11:32:33] Is it possible something was messed up with the indexes in eqiad? [11:32:40] unlikely [11:33:12] innodb has this propery that if there is corruption, it automatically shuts down the server [11:33:54] I would like to see some load testing on an equaly sized table [11:34:02] with lots of insert and selects [11:34:14] Is it possible the query execution plan between codfw and eqiad masters was different? [11:34:20] yes [11:34:33] if you did lots of inserts recently [11:34:48] analyze is needed to get up-to-date statistics [11:35:36] but that is a quite farfettched assumption [11:35:43] without proof [11:36:42] marostegui, did you restart db1069:3313 or did it crash? [11:37:40] nope, i didn't touch it [11:37:41] it crashed [11:37:46] :( [11:37:49] right now? [11:37:58] 20 minutes ago [11:38:35] should we enable gtid on db1069? [11:38:47] not very important now [11:38:52] more like in the future [11:38:55] after reloads [11:39:11] mmmm, not sure if it would work [11:39:17] with the filtering and all that [11:40:11] there are some OOM on the dmesg, but don't know when they are from [11:40:49] jynus: maybe not that farfetched as the code had never before run on eqiad? [11:40:59] And when it started running on codfw the tables were empty [11:41:12] [Thu May 4 14:24:21 2017] Killed process 25801 (mysqld) total-vm:62602460kB, anon-rss:21584060kB, file-rss:0kB [11:41:30] addshore, ok, but can you point to which queries are causing that? [11:41:37] that is what I meant [11:41:55] you have to say "I think this query was being slow for X reason" [11:41:55] there it is then [11:42:13] it is in the future, but ok [11:42:28] oh shit [11:42:28] yes [11:42:28] XDD [11:42:50] well, at least -T man says so [11:42:55] Print human readable timestamps. The timestamp could be inaccurate! [11:42:56] XD [11:42:59] #justdmesgTthings [11:44:46] it's systemd's minority report [11:46:00] systemd pre-cog log [11:46:31] 10DBA, 06Operations, 10ops-eqiad: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3234912 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done: Removed it from pr... [11:51:07] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3234935 (10Marostegui) @Ottomata remember to: `stop all slaves;` before shutting down MySQL (not a hard requirement, but just in case there is a transaction hanging,... [12:31:59] jynus: well it looks like all queries were the issue, 14:20 MW was set to write mode again, and 14:21 the cognate queries start appearing in the log being killed by the slow killer etc. 14:21 is also when the first table level locks appear [12:42:34] codfw even only has 1 servers for x1, hah [12:47:52] addshore, that is now [12:48:13] we just deleted its slave [12:48:23] ack [13:50:58] 10DBA, 06Operations: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235290 (10jcrespo) [13:52:22] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3235306 (10jcrespo) [13:52:24] 10DBA, 06Operations: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#3235308 (10jcrespo) [13:52:26] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3235307 (10jcrespo) [13:52:29] 10DBA, 10Datasets-General-or-Unknown, 06Release-Engineering-Team, 13Patch-For-Review, and 2 others: Automatize the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459#3235305 (10jcrespo) [13:52:53] 10DBA, 10Datasets-General-or-Unknown, 06Release-Engineering-Team, 13Patch-For-Review, and 2 others: Automatize the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459#1417808 (10jcrespo) [13:52:58] 10DBA, 10MediaWiki-Database, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 3 others: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3235311 (10jcrespo) [13:53:17] 10DBA, 06Operations: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235313 (10jcrespo) [13:53:52] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3235314 (10Ottomata) @cmjohnson, I'll be on vacation next week, and then at the analytics offsite the following. Coordinate with @elukey if you want to do it next we... [14:35:09] any idea what to do about db1015 [14:35:28] I was thinking on recovering it except cebwiki [14:49:51] 10DBA, 06Labs, 10Tool-Labs, 07Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#3235540 (10jcrespo) [14:55:58] I would try to recover as much as possible indeed to have some stats about how long it takes and so forth [15:02:57] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3235579 (10Marostegui) db2066 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2066.codfw.wmnet... [15:03:55] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3235584 (10Marostegui) db2066 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2066.codfw.wmnet wikidatawiki -... [15:05:56] jynus: ok so if I run 'maintain-views --databases wbwikimedia --replace-all --debug' on labsdb1003 atm it still throws '"CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")' [15:06:01] but that is working when you run it? [15:06:11] but looks like he deleted the user first [15:06:15] (as per his ticket comment) [15:06:21] no, it worked on labsdb1001 [15:06:24] ah [15:06:30] which is what was broken [15:06:42] labsdb1003 was the outlier noted here https://phabricator.wikimedia.org/T164103#3232673 [15:06:45] IIRC labsdb1003 worked last time [15:06:47] or when I tried it [15:06:48] heh [15:06:57] whackamole my friend [15:07:09] no, I can do the same thing I did ont 1001 [15:07:18] that issue is so mad... [15:07:35] could it be some sort of user cache or something? [15:07:45] maybe we can try flushing the user table with read lock or soemthing? [15:07:49] before running the script? [15:07:54] "Failed to find table aft_article_feedback in database wbwikimedia" [15:07:55] jynus: when I ran through it it was def labsdb1003 that didn't work for me [15:08:05] that is not a permission issue [15:08:18] what causes you to see that and where jynus? [15:08:23] I see a different error so wondering [15:08:33] that is just, "don't ask me to create a view that doesn't exist" [15:08:45] it is debug, so I assume it is not an error [15:08:50] sure, that shouldn't happen, there is logic to skip non-existent tables [15:08:52] but atm [15:08:56] [root@labsdb1003 15:08 /root] [15:08:56] # maintain-views --databases wbwikimedia --replace-all --debug [15:08:58] fails w/ [15:09:03] on 1003? [15:09:06] pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'") [15:09:08] right [15:09:34] ok, looking at that [15:09:38] marostegui: yeah idk, open to trying whatever you think [15:09:50] I did try a flush privileges before I noted https://phabricator.wikimedia.org/T164103#3232673 [15:09:56] which I should have communicated [15:10:15] no worries chasemp i didn't have any hope to be honest XD [15:10:41] yep, now it works [15:10:42] I try not to get creative in teh hopes that something jumps out at you guys when you see the errors or timeline or something [15:10:48] jynus: maybe trying flushing the mysql user table with read lock? [15:10:53] no [15:10:54] jynus: flush something? [15:10:54] ah [15:10:56] no [15:11:06] "just drop the user and recreate it" [15:11:06] what caused it to work? [15:11:10] huh [15:11:16] * chasemp bangs head on wall [15:11:16] and literaly it is caused to work [15:11:22] it is not fix [15:11:30] so if we try the command again, it will fail again? [15:11:32] not because it shouldn't work [15:11:44] but because it should work in the first place [15:11:44] marostegui: seems to work thereafter for at least 4 tries [15:11:49] gosh... [15:11:51] maintain-views --databases wbwikimedia --replace-all --debug that is [15:11:59] let me try the rest [15:12:17] my theory [15:12:32] is that some cache gets old for dbs that are not yet created [15:12:51] so the permissions are non existent for new dbs [15:12:52] so on 1003 the view creation now worked for all 4 new wiki's [15:12:53] in other workds [15:13:11] we have to create the user when adding new dbs [15:13:15] that is a bug [15:13:21] * chasemp nods [15:13:27] and non existent on the new servers [15:13:33] because we do not use wildcards anymore [15:13:40] we use user-defined lists [15:13:48] initially for security reasons [15:14:01] but probably there is a bug with wildcards [15:14:14] at least on the 10.0.X series [15:14:36] I noted what fixed it and that I ran things https://phabricator.wikimedia.org/T164103#3235624 [15:14:59] thanks [15:15:06] and again, sorry for that hapepning [15:15:07] so just dropping + recreating the maintainviews user fixes the issue…but at some point some months ago the issue was also fixed with flush privileges if I recall correctly? [15:15:18] maybe [15:15:37] it reached a point of frustration that I was just trying things randomly [15:15:49] next time let's try flushing mysql.user table [15:15:53] and if it is cache related [15:15:53] don't know.. [15:15:56] who knows [15:16:01] if it is that I say [15:16:11] it should be the database_priv [15:16:21] *what [15:16:31] we can try both XDD [15:16:32] we can search the bug database [15:16:58] well we know it's at least reocurring and not a total one-off by this point so worth stepping through more slowly next round [15:17:07] I'll ping one you guys from the beginning I think next new wiki db [15:17:46] last time we were more lost [15:17:54] because there were other real-issues mixed [15:17:59] grants missing, etc. [15:18:04] definitely [15:18:25] even this time there was a "user operator" error [15:18:46] well, more like script/bad error message by mysql [15:19:04] hey it's better than teh days when we just didn't do it as no one knew how :D [15:19:07] but it was a real problem (trying to create a view for a database that didnt exist) [15:19:34] now we know more or less where it is coming from [17:00:28] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140543 (10RobH) 05stalled>03Resolved Yes, this is a new system, so it's reclaimed to spares, not decommissioned... [17:45:26] jynus: am I correct in thinking 100% requests to a 'replica' for x1 are currently going to db1029.eqiad.wmnet ? [17:45:53] see my comment [17:46:30] *looks* [17:47:55] wtf.... [17:48:09] I think it is too bad to be true [17:48:21] but I wanted to share it first [17:48:48] Okay, lets see if any of these selects are taking an unusual amount of time too [17:49:43] the only long running query I see on that db is SELECT * FROM `cognate_pages`, but that was 1 instance run by me [17:50:03] I see 39724 "SELECT `cgsi_interwiki` , `cgp ... ( `cgpa_site` = `cgsi_key` )" [17:50:35] but none last run in the last 8 hours [17:51:12] I get no queries on db1029 [17:51:32] although I see a bunch of idle connections [17:51:51] not necessarily idle [17:52:01] could be fast queries [17:52:16] but I do not see them registering queries on the db monitoring [17:53:11] I got one SELECT /* Cognate\CognateStore::selectLinkDetailsForPage */ [17:53:29] the monitoring seems to not be working [17:53:35] something is wrong there [17:53:45] so my comment was a false alarm [17:56:26] "I got one SELECT /* Cognate\CognateStore::selectLinkDetailsForPage */" yup, thats essentially the only query running [17:56:39] so jynus queries are not just going to the master? or? [17:56:46] no [17:56:49] those are old queries [17:57:07] just there are not queries being properly monitored on the slave [17:57:16] okay [17:58:03] maybe it doesn't register queries lower than some threshold or something [17:58:27] most queries are taking less than 3ms [17:58:34] Per https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate-repo [17:59:23] enable writes on 2 small wikis, see what happens [18:01:24] ack! [18:03:01] mysql tables in use 3, locked 0 [18:04:37] so in the past 30 mins there have only been 3 writes blocked by being in readonly [18:05:04] as the dashboard links counts the write attempts before the read only check [18:05:09] *linked [18:06:07] there are high rates of ps_check_lost_instrumentation [18:06:23] that would explain the lost intrumentation [18:06:27] * addshore does not know what ps_check_lost_instrumentation is [18:06:37] don't worry about it [18:07:52] with such low write rates (1 every 10 minutes) do you see an issue with turning readonly off for all wiktionaries for a period? or perhaps off for enwiktionary and frwiktionary (the two biggest) to make sure we see some writes? [18:08:14] firs small ones [18:08:41] it took only 2 minutes to create issues [18:08:56] so it wasn't a small ramp-up like cxtranslation [18:08:56] Ahh, I was looking at the average per minute rather than total, so average 3 writes per minute, 170 in the past 40 ish mins [18:10:55] I could turn it on for all 'small' wiktionaries, which would be 87 projects. Or choose a couple of 'medium' projects ? [18:11:29] up to you [18:12:46] the connections seem to be going all to the slave [18:12:52] so that is good [18:17:02] * DanielK_WMDE still doesn't see what caused things to fall over in the first place [18:23:35] I see some deadlocks [18:24:26] it is echo [18:24:28] i only see one write every few minutes on grafana. [18:24:34] ah [18:24:51] I am monitoring locking [18:25:12] do writes happen on every edit? [18:25:31] what is the model? [18:25:32] no. page creation, deletion, redirection, un-redirection. [18:25:47] and only for wiktionaries [18:25:48] synchronously or asunc? [18:25:54] so one every few minutes sounds about right [18:26:10] deletion is async, the others are sync [18:26:18] that's not our choice, that's just how the hooks work [18:26:26] I am just asking [18:26:29] not judging [18:26:33] sure, i'm just saying ;) [18:27:02] note that async stuf works by using the hooks to create jobs- but something that you already know [18:27:05] the time the problems started seems to be exactly the time the job runners were enabled [18:27:19] but according to you [18:27:26] only deletions are done like that [18:27:31] yes. well. it's actually a bit more round-about than that. but eventually, jobs are created [18:27:34] while I saw many inserts [18:27:39] not many deletions [18:27:50] yes. strange. [18:28:02] the reason I am asking this [18:28:15] is because it could hapen some massive bot job [18:28:25] could have been started and overloaded the extension [18:28:34] yea, but that would mean a massive number of page creations [18:28:46] during the DC switch [18:28:46] we had a massive number of connections [18:28:57] 10000 connections in 2 minutes? [18:29:06] that is a lot [18:29:08] i have no idea how that would ahppen [18:29:38] adam and me tried to come up with an explanation for an hour. [18:29:40] no idea [18:30:03] so we thought that it would be best to add more logging, and try carefully [18:30:08] another thing that is special about x1 is that it only has one slave [18:30:27] so semisync has to wait for it to perform the writes [18:30:35] I am seeing that now [18:30:40] but it is getting slow on echo [18:30:56] "slow" (think 100 ms top) [18:31:03] not on cognate [18:31:43] the good news is this: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1029&panelId=19&fullscreen&refresh=1m [18:31:56] but I think that is more of my doing [18:32:00] addshore: btw: having deletions done asynchronously is kind of nasty. if a page gets deleted by accident and the nundeleted, we follow up on the delete *after* we record the page (re)creation. that causes corruption. [18:32:11] not related to the load issue. just a thought [18:32:27] DanielK_WMDE: ack, well, especially if they end up in the job queue [18:32:42] addshore: which they do these days, afaik [18:32:43] I see 0 table_lock waited [18:32:47] I guess we can just abuse that hook and do the delete in it and not queue any updates [18:33:03] addshore: can't we just do both asynchronously? [18:33:17] addshore: bah, just file a ticket for now, it's not urgent [18:33:26] DanielK_WMDE: ack! [18:33:32] jynus: i can't imagine what would even cause a table lock [18:34:38] one thing that was different is warmup [18:37:12] SELECT /* Cognate\CognateStore::selectLinkDetailsForPage */ [18:37:17] mysql tables in use 3, locked 0 [18:38:14] I'm not convinced we have actually got any writes through yet [18:38:25] we may have to include more projects [18:38:46] addshore: just create a page :D [18:38:48] otherwise the times of the write methods should start showing up on the new grafana dashboard [18:38:56] DanielK_WMDE: people will probably shout at me :P [18:39:08] then delete it again >:D [18:39:21] I don't think I can delete on any wiktionaries [18:39:28] don't you have a testwiktionary? [18:39:53] i don't think so [18:40:01] that is not a thing [18:40:08] so where exactly this was tested on production? [18:40:20] before enabling it on all wiktionaries? [18:41:10] it was tested on beta. i don't think it was tested on the production cluster. was it? [18:41:20] beta is not a test at all [18:41:27] to test it, we would have needed two test wiktionaries [18:41:31] complete difefrent workflow [18:41:40] well, actually, we set up dedicated test instances on labs [18:41:46] it doesn't matter [18:41:46] but yes, different workflow [18:41:52] that is a bug [18:41:55] whatever [18:41:59] but not an excuse [18:42:16] testwiki and test2wiki are precisely for those things [18:42:58] but they run with wikipedia config, not wiktionary. but i suppose it would have been possible anyway [18:43:17] yes, it was tested on beta and on labs instances but not in prod prod [18:43:20] I see an actionable there [18:43:24] for next time [18:43:37] and yes, not your fault because beta is horrible [18:43:44] but somethig to do [18:43:48] yes, i agree. [18:43:53] specialy for brand new extensions [18:44:18] it is not about you, consider our setup "broken" and highly aggresive [18:44:24] sounds like we need a testwiktionary and a test2wiktionary [18:44:28] and be very very pesimistic [18:44:51] addshore: testwiki and test2wiki would have worked, i think [18:44:59] people do all of crazy things on production [18:45:03] hehe [18:45:05] I see them all the time [18:45:15] DanielK_WMDE: not so sure as they get interwiki links from wikibase right? [18:45:21] the other day someone purged all pages of a large wiki [18:45:25] maybe a test3wiki and test4wiki ;) [18:45:34] the day before someone asked on api every single page of enwiki [18:45:40] DanielK_WMDE: hm... they may, yes... could be done per namespace?... [18:46:50] I understand that it cannot be done gradually because of the nature of the functionality [18:46:54] DanielK_WMDE: ack, could work, i do highly doubt that would have changed anything here though, perhaps it would have meant deployment would have waiting until the switchovers had happened, and then whatever caused this would have just happened at the next switchover! [18:46:57] it is all of none [18:47:19] but maybe something could have been done more slowly [18:47:32] addshore: i agree that it probably wouldn't have helped. but it's still a good idea to test on the prod cluster [18:47:38] jynus: as I don't think we have seen any writes yet from the small wiktionaries how do you feel about me adding the "medium" sized ones? [18:47:42] that should be part of the extension deployment checklist [18:49:10] addshore, you suppose this is switchover related [18:49:16] It's not in the one on https://www.mediawiki.org/wiki/Review_queue#Preparing_for_deployment [18:49:18] and yes, it happened at the same time [18:49:45] but same for cxtranslation, it showed at that time [18:50:17] but there was a bad logic there of retries, only magnified by the slower performance minutes after the change [18:51:02] addshore: well, that test would happen after deployment, before enabling the extension on all wikis... [18:51:14] may be with a comment on the talk page there [18:53:49] DanielK_WMDE: I have added looking into those docs to my todo list [18:54:02] jynus: how could you feel about trying readonly false for medium wikis? [18:54:53] ok to me [18:56:53] btw: Cognate wasn't enabled on all wiktionaries at once. iirc, it was first enabled on a handful of medium wiktionaries. right, addshore? [18:57:28] I can't remember off the top of my head :/ [18:58:22] "I have informed the WikiWoordenboek community and I will see if there is a fast way to purge all the pages involved." [18:58:31] ^that is where issues happen [18:59:00] you shouldn't have encourage people to purge all pages [18:59:42] now that will create issues on cognate and maybe other unrelated stuff [19:02:50] I wouldn't say I encouraged them to do anything, I merely stated the facts, I havn't seen the " I will see if there is a fast way to purge all the pages involved." comment yet but we can simply reply saying don't do that and this is why [19:03:30] Okay, now also on medium wiktionaries, let's see if any writes actually happen [19:05:26] Okay, I see a couple of inserts that should have happened now [19:07:33] 285ms for an insert isn't blazingly fast... [19:07:36] but still ok i guess [19:07:54] so could a bunch of deletes being enqueued [19:07:55] Well, that is 1 select and 1 or possible 2 inserts [19:08:02] grafana is confusing. wtf is an average mean? [19:08:08] and that block writes? [19:08:38] in theory, yes... [19:08:47] DanielK_WMDE: it depends which avg your looking at avg upper, is the average upper value per minute for example [19:08:49] do you know whether automatic transactions are flushed between jobs? [19:09:11] addshore: yea, but an average of means is just... odd ;) [19:09:13] I was expecting you to tell me :-) [19:09:37] jynus: no idea. /me looks at AaronSchulz [19:10:39] jynus: you now might see a couple of selects on master (known bug) fixed in https://gerrit.wikimedia.org/r/#/c/351781/ but this hasn't been backported [19:12:21] $lbFactory->commitMasterChanges( __METHOD__ ); // flush any JobQueueDB writes [19:12:35] That's in JobRunner::run [19:13:00] I *think* this flushes the transactions... oooohhhhh.... hm.... not if it's the wrong cluster... [19:13:02] yes, they are [19:13:14] AaronSchulz: hey aaron! [19:13:33] so, if a job writes to another wiki's DB - do the transactions still get flushed? [19:13:43] Even if that DB is on a different cluster? [19:13:47] That would be another LB, right? [19:14:36] that would mean the job runner could hold a bunch of locks on a table, accross the execution of several jobs [19:16:01] AaronSchulz: ? [19:25:31] they all should, yes [19:30:32] Yeh, lbFactory, not just Lb! [19:33:31] jynus: how would you feel about leaving Cognate in it's current state tonight then? [19:33:54] yeah [19:34:09] I see no table locks [19:34:25] We can leave the 'large' wikis for next week [19:34:29] which were happening when I first ping you [19:34:35] pinged [19:34:56] I do not know if we are going to get something in clear [19:35:09] I overreacted and changed the infrastructure [19:35:39] and maybe it is never going to happen with slave/master separation + your patches [19:36:51] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3236866 (10Ladsgroup) [19:39:22] so the patches we made aren't currently backported, only the readonly and the extra monitoring [19:39:47] so at the start of next week we could try switching the infrastructure changes back (perhaps), along with testing on some larger wiktionaries? [19:40:28] I would also suggest convincing someone to do a batch change on one of the enables wikis [19:40:34] and monitor performance [19:40:51] even that guy with the purges [19:43:58] 07Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 06Scoring-platform-team: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530#3236866 (10jcrespo) Thanks. One thing that I have been suggesting lately is not having unique (or primary keys) as... [19:45:12] I can probably trigger a bunch of controlled reparses