[01:26:19] 10DBA, 10Wikidata: Wikidata: Possibility to add comments with more than 255 bytes, potential db issue - https://phabricator.wikimedia.org/T117272#3233942 (10Krinkle) [07:53:48] jynus: yesterday you said"queries can be seen on tendril filtering by db1031" could you walk me through how to get there? :) [07:54:01] got the the main page [07:54:04] *go [07:54:08] yup! [07:54:13] click reports [07:54:28] yup [07:54:35] Slow Queries [07:54:53] host:db1031 [07:55:06] delete user and schema [07:55:45] Great, and I guess set hours to 24? [07:55:51] yes [07:56:16] And db1031 is the master for that shard I assume? [07:56:17] you can click on hits to see the top [07:56:23] of individual queries [07:56:27] yes [07:56:32] that is another issue [07:56:40] db1029 is barely used [07:57:03] and it has more read weight [07:59:45] well, this is a bit off, selectLinkDetailsForPage which appears at the top of db1031's slow query list is marked in mediawiki to use a replica [08:00:33] assuming that is working [08:00:40] then you have a blockage on the master [08:00:57] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234220 (10Marostegui) Position where tempdb2001 slave was stopped at: https://phabricator.wikimedia.org/P5374 Backu... [08:01:06] and this confirms it: https://phabricator.wikimedia.org/T164407#3232480 [08:02:03] Hmm, But I'm confused here as to why a query made using a connection to DB_REPLICA end up running on the master> [08:02:18] we have the master with read traffic [08:02:35] master is always part of the read pool [08:03:02] Interesting, now that I did not know. [08:03:35] in fact it is the only read replica we cannot depool [08:03:54] and if there is lag, it is the only replica that will be used (not really, but you get the idea) [08:04:20] could that lead to issues? [08:04:31] because if it is that, it would be great [08:05:14] is there only 1 master and 1 slave in this shard? [08:05:26] yes, on this datacenter [08:05:48] but this contracts you: https://gerrit.wikimedia.org/r/#/c/351682/2/maintenance/recalculateCognateNormalizedHashes.php [08:06:27] I mean, i guess if the slave got lagged, and then the selects ended up hitting the master I guess the selects would get in the way of the writes and then eventually lock everything up [08:06:38] no [08:06:45] selects do not block writes on innodb [08:06:53] unless they are blocking selects [08:07:04] only writes block other writes [08:07:46] Okay, so there are only 3 selects that run in production [08:07:54] that is excluding the 2 maint secripts [08:07:57] *scripts [08:08:41] no, it is very likely that there are some selects on the master, it has weight 10 [08:09:00] the problem is that there was not overload on the slave, which has weight 20 [08:09:10] so you have a master-only problem [08:09:26] you are blocking your tables or rows [08:09:57] https://phabricator.wikimedia.org/T164407#3232488 [08:10:13] ^that shows queries where blocked waiting [08:10:20] that means contention [08:10:24] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3234243 (10Marostegui) 05Open>03Resolved Thanks @Papaul all the hosts are looking good! I will mark this ticket as resolved then. [08:10:41] maybe your selects got serialized [08:10:49] do you do a lot on the same transaction? [08:10:55] Yup, just trying to figure out which one of the few queries is blocking everything [08:11:07] that is a good question, I cannot answer [08:11:30] do you think I would not tell you "fix this part" if I knew :-D [08:12:00] hehe, I wish it were that easy [08:12:00] and that is why I asked you to investigate when I saw bad patterns before the outage [08:12:26] No, no transactions / big transactions [08:13:38] I wrote to you all the info I had on the comments, maybe with the connection ids + timestamps you can create some patern [08:13:47] plus the code, of course [08:14:15] the table lock looks realy really bad [08:14:22] I do not know where that comes from [08:14:36] innodb never escalates row locks to table locks [08:16:49] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234251 (10jcrespo) "Decommissioned" in the software sense, I think they will want it to return to spares. Just clar... [08:17:07] so I would really start for what could be adquiring a table lock [08:19:05] I am not very sure about this connection manager [08:19:33] I have not looked at it, but it would seem as a recipe to maintain connections open for a long time [08:23:15] So, it shouldn't maintain connections any more than the mw LoadBalancer already does [08:23:30] which is already quite bad :-) [08:23:40] but anyway, offtopic [08:23:47] That's more AaronSchulz's domain ;) [08:28:21] note also the times of the selects [08:28:29] those were all happening at the same second [08:29:25] and they could be within a transaction [08:31:18] oh, I have more info for you [08:32:58] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3234269 (10Marostegui) Thanks for the clarification :-) [08:34:20] 10DBA, 10Beta-Cluster-Infrastructure, 06Operations, 06Release-Engineering-Team, 13Patch-For-Review: Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3234285 (10hashar) On deployment-tin I created: ``` lang=ini,name=/etc/mysql/conf.d/prompt.cnf [mysql] prompt = "\u@\h[... [08:36:35] I think I know the issue [08:38:20] oooh [08:38:42] see all those inserts and deletes taking 60 seconds? [08:38:48] I kill those [08:39:03] but killing takes some time- it has to revert changes to a propers state [08:39:24] when there are more writes to kill than the script can handle, it gets overloaded [08:39:28] also blocking selects [08:39:35] so forget about the selects [08:39:50] the cause are the writes taking 60 seconds [08:39:58] most likely blocking each other [08:40:04] okay! [08:40:21] there is an infrastructure way to fix those [08:40:41] which is, if you know when contention can happen [08:40:58] is set innodb_lock_wait_time (not sure abut the name) [08:41:05] so, why on earth are these writes taking more than 60 seconds? [08:41:05] so they fail in X seconds [08:41:18] blocking each other, most likely [08:41:36] of course the above is a "bad" fix [08:41:49] the idea is to avoid contention in the first place [08:42:15] but if you do not know how to- just [08:43:31] https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout [08:43:41] and of course, you have to handle the failure [08:43:58] so you were probably having some concurrency issues on writes [08:44:29] either by bad performance (indexes) or by trying to write the same row set at the same time [08:45:30] innodb_lock_wait_timeout is set to 50 seconds [08:46:17] however, it only kills the current statemt [08:46:27] doesn't do a full rollback [08:48:02] however, that doesn't match the metrics [08:48:12] So, there are only 2 single row inserts done in prod code, one on the cognate_titles table and one of the cognate_pages table and one single row delete from the cognate_pages table [08:48:12] the cause seems still to be a full table lock [08:48:36] see: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=19&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=1493816153771&to=1493845145411 [08:48:52] table locks waited are gone after disabling [08:49:24] lock waits are probably caused by other extensions (we know for sure cxtransaction does that, that is what caused the other issue) [08:49:59] addshore it doesn't matter how small the inserts are [08:50:11] what matters is how many rows you lock on doing that [08:50:20] (bad indexes) [08:50:35] and if you lock the same set many times [08:52:15] or do you write and keep the transaction open after that, while you do other stuff? [08:53:31] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3234342 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done: - Removed it from prometheus, added it as a spare on site.pp and removed it from dhcp list:... [08:55:00] So, there are no explicit transactions in the cognate code. [08:55:33] at most 2 inserts happen in a row on a connection (one to the cognate_titles table and one to the cognate_pages table) before the Connection is handed back to the Loadbalancer [08:56:32] I'm wondering if mediawiki has any sort of odd implicit transaction handling built in but I doubt it [08:59:46] oh, it does [09:00:12] not sure if implicit [09:00:17] but it has the mechanism [09:00:42] but the same write on the same transaction to 2 tables is how it should be done [09:03:08] So prior to https://gerrit.wikimedia.org/r/#/c/351682/2 there will also have been a select bundled in on the master prior to the 2 writes (https://gerrit.wikimedia.org/r/#/c/351682/2/src/CognateStore.php) [09:12:46] so I can do some things [09:12:57] I can put the master with 0 load [09:13:06] and I can stop the killing on the master [09:13:10] *s [09:13:25] that will not solve the original problem [09:13:31] but will allow to debug? [09:14:04] and things should be compatible with the previous configuration anyway [09:14:28] addshore you should EXPLAIN the writes [09:14:37] and see if they block a lot of rows [09:14:47] ack! [09:14:52] and also see if a bot can cause large blockages [09:15:00] by writing to similar rows [09:15:32] I would bet it is something like inserting new rows blocks more than it should because it reads other rows [09:16:27] I think I'll also add a readonly mode to cognate, which would enable us to turn off all writes to the db from it instead of the whole extension next time [09:16:40] that would be lovely indeed [09:17:01] or some kind of rate-limiting [09:17:19] these things happen when someone does mass-edits [09:17:25] that way interwikilinks can remain on wiktionaries, even though they will slowly go out of date, but then readonly can be turned off and the maint script run to repopulate the table, might need 1 more maint script to cope with deletes that happened during read only times too [09:17:41] normal flows have been refined to allow for that [09:18:01] but you can be surprised how difficult it is for new workflows to scale [09:18:27] it is multiple clients doing multiple edits on multiple, distributed servers [09:19:59] the thing is- it is a chicken and egg situation [09:20:14] if I kill too much, it is bad, but if I kill too little, it is, too [09:20:31] the only way to solve that is to not use the master at all [09:20:35] Yup! [09:20:59] I am going to pool it at 0 weight [09:21:11] and then I will report all SELECTs I see as bugs [09:21:16] Right, I'm going to look at making the readonly patch in a minute, and also at some of the queries in more detail. [09:21:29] if you do a read only now [09:21:49] we can deploy it now - where now is when releng allows it, I just mean ok for me [09:21:49] Just have to quickly shovel some food in my mouth [09:24:02] We have already merged 3 patches around this issue on the extension, 2 using replicas for selection instead of masters, and 1 trying to release connections as early as possiblre [09:24:19] Would it make sense to debug with the old code first to try to ensure that we have caught the origional issue? [09:25:10] not on production [09:25:25] it would be cool if it had its own server [09:25:28] but it doesn't [09:26:08] need a x2 ;) [09:26:20] *an [09:32:32] i do not like your writes at all [09:32:50] INSERT IGNORE INTO cognate_pages [09:33:12] unique index: cgpa_title,cgpa_namespace,cgpa_site [09:43:15] 10DBA: db1022 broke while changing topology on s6- evaluate if to fix or directly decomission - https://phabricator.wikimedia.org/T163778#3234542 (10Marostegui) [09:57:12] jynus: how can I go about explaining the queries? I cant use the sql.php maint script to get to the master per https://phabricator.wikimedia.org/T164469 [09:57:18] back in 15, meeting [10:24:18] jynus: as I can't explain, what is it that you don't like about the writes? [10:33:50] jynus: can we can an uncliped version of https://phabricator.wikimedia.org/P5376 too so that we can see the values in the queries? [11:13:41] addshore, no, for performance reasons (this log is on the servers themselves), we only store the first 100 bytes [11:14:10] I can take a look at the binlog for the ones that went though [11:14:18] but not the ones that were killed [11:17:55] We are still very confused as to what has caused the table level locks in the first place, as nothing in our code does that, and you say innodb does not escalate row locks to table locks. [11:18:56] 10DBA: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3234884 (10Marostegui) [11:19:56] do you retry failed inserts? [11:21:09] nope [11:22:12] the table-level locks could be me killing 60-second inserts [11:22:38] and be a red hearring [11:22:49] How long would those locks last? [11:23:05] I only saw those table -level locks before, not during the event [11:23:21] I guess the next question is what could cause those inserts / deletes to take 60 seconds in the first place [11:23:27] yep [11:24:12] thing is once queries are started to be killed, probably the query killer cannot keep up [11:29:52] on the other side, table-locks-waited ends at 17, and didn't appear on codfw [11:32:33] Is it possible something was messed up with the indexes in eqiad? [11:32:40] unlikely [11:33:12] innodb has this propery that if there is corruption, it automatically shuts down the server [11:33:54] I would like to see some load testing on an equaly sized table [11:34:02] with lots of insert and selects [11:34:14] Is it possible the query execution plan between codfw and eqiad masters was different? [11:34:20] yes [11:34:33] if you did lots of inserts recently [11:34:48] analyze is needed to get up-to-date statistics [11:35:36] but that is a quite farfettched assumption [11:35:43] without proof [11:36:42] marostegui, did you restart db1069:3313 or did it crash? [11:37:40] nope, i didn't touch it [11:37:41] it crashed [11:37:46] :( [11:37:49] right now? [11:37:58] 20 minutes ago [11:38:35] should we enable gtid on db1069? [11:38:47] not very important now [11:38:52] more like in the future [11:38:55] after reloads [11:39:11] mmmm, not sure if it would work [11:39:17] with the filtering and all that [11:40:11] there are some OOM on the dmesg, but don't know when they are from [11:40:49] jynus: maybe not that farfetched as the code had never before run on eqiad? [11:40:59] And when it started running on codfw the tables were empty [11:41:12] [Thu May 4 14:24:21 2017] Killed process 25801 (mysqld) total-vm:62602460kB, anon-rss:21584060kB, file-rss:0kB [11:41:30] addshore, ok, but can you point to which queries are causing that? [11:41:37] that is what I meant [11:41:55] you have to say "I think this query was being slow for X reason" [11:41:55] there it is then [11:42:13] it is in the future, but ok [11:42:28] oh shit [11:42:28] yes [11:42:28] XDD [11:42:50] well, at least -T man says so [11:42:55] Print human readable timestamps. The timestamp could be inaccurate! [11:42:56] XD [11:42:59] #justdmesgTthings [11:44:46] it's systemd's minority report [11:46:00] systemd pre-cog log [11:46:31] 10DBA, 06Operations, 10ops-eqiad: Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission) - https://phabricator.wikimedia.org/T163778#3234912 (10Marostegui) a:03Cmjohnson The host is ready to be decommissioned. What I have done: Removed it from pr... [11:51:07] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3234935 (10Marostegui) @Ottomata remember to: `stop all slaves;` before shutting down MySQL (not a hard requirement, but just in case there is a transaction hanging,... [12:31:59] jynus: well it looks like all queries were the issue, 14:20 MW was set to write mode again, and 14:21 the cognate queries start appearing in the log being killed by the slow killer etc. 14:21 is also when the first table level locks appear [12:42:34] codfw even only has 1 servers for x1, hah [12:47:52] addshore, that is now [12:48:13] we just deleted its slave [12:48:23] ack [13:50:58] 10DBA, 06Operations: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235290 (10jcrespo) [13:52:22] 10DBA, 06Operations, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3235306 (10jcrespo) [13:52:24] 10DBA, 06Operations: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#3235308 (10jcrespo) [13:52:26] 10DBA, 13Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#3235307 (10jcrespo) [13:52:29] 10DBA, 10Datasets-General-or-Unknown, 06Release-Engineering-Team, 13Patch-For-Review, and 2 others: Automatize the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459#3235305 (10jcrespo) [13:52:53] 10DBA, 10Datasets-General-or-Unknown, 06Release-Engineering-Team, 13Patch-For-Review, and 2 others: Automatize the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459#1417808 (10jcrespo) [13:52:58] 10DBA, 10MediaWiki-Database, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), and 3 others: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3235311 (10jcrespo) [13:53:17] 10DBA, 06Operations: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3235313 (10jcrespo) [13:53:52] 10DBA, 06Operations, 10ops-eqiad: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3235314 (10Ottomata) @cmjohnson, I'll be on vacation next week, and then at the analytics offsite the following. Coordinate with @elukey if you want to do it next we... [14:35:09] any idea what to do about db1015 [14:35:28] I was thinking on recovering it except cebwiki [14:49:51] 10DBA, 06Labs, 10Tool-Labs, 07Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#3235540 (10jcrespo) [14:55:58] I would try to recover as much as possible indeed to have some stats about how long it takes and so forth [15:02:57] 07Blocked-on-schema-change, 10DBA, 10Wikidata, 13Patch-For-Review, 03Wikidata-Sprint: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539#3235579 (10Marostegui) db2066 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2066.codfw.wmnet... [15:03:55] 10DBA, 10Wikidata, 13Patch-For-Review, 07Schema-change: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548#3235584 (10Marostegui) db2066 is done: ``` root@neodymium:~# mysql --skip-ssl -hdb2066.codfw.wmnet wikidatawiki -... [15:05:56] jynus: ok so if I run 'maintain-views --databases wbwikimedia --replace-all --debug' on labsdb1003 atm it still throws '"CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")' [15:06:01] but that is working when you run it? [15:06:11] but looks like he deleted the user first [15:06:15] (as per his ticket comment) [15:06:21] no, it worked on labsdb1001 [15:06:24] ah [15:06:30] which is what was broken [15:06:42] labsdb1003 was the outlier noted here https://phabricator.wikimedia.org/T164103#3232673 [15:06:45] IIRC labsdb1003 worked last time [15:06:47] or when I tried it [15:06:48] heh [15:06:57] whackamole my friend [15:07:09] no, I can do the same thing I did ont 1001 [15:07:18] that issue is so mad... [15:07:35] could it be some sort of user cache or something? [15:07:45] maybe we can try flushing the user table with read lock or soemthing? [15:07:49] before running the script? [15:07:54] "Failed to find table aft_article_feedback in database wbwikimedia" [15:07:55] jynus: when I ran through it it was def labsdb1003 that didn't work for me [15:08:05] that is not a permission issue [15:08:18] what causes you to see that and where jynus? [15:08:23] I see a different error so wondering [15:08:33] that is just, "don't ask me to create a view that doesn't exist" [15:08:45] it is debug, so I assume it is not an error [15:08:50] sure, that shouldn't happen, there is logic to skip non-existent tables [15:08:52] but atm [15:08:56] [root@labsdb1003 15:08 /root] [15:08:56] # maintain-views --databases wbwikimedia --replace-all --debug [15:08:58] fails w/ [15:09:03] on 1003? [15:09:06] pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'") [15:09:08] right [15:09:34] ok, looking at that [15:09:38] marostegui: yeah idk, open to trying whatever you think [15:09:50] I did try a flush privileges before I noted https://phabricator.wikimedia.org/T164103#3232673 [15:09:56] which I should have communicated [15:10:15] no worries chasemp i didn't have any hope to be honest XD [15:10:41] yep, now it works [15:10:42] I try not to get creative in teh hopes that something jumps out at you guys when you see the errors or timeline or something [15:10:48] jynus: maybe trying flushing the mysql user table with read lock? [15:10:53] no [15:10:54] jynus: flush something? [15:10:54] ah [15:10:56] no [15:11:06] "just drop the user and recreate it" [15:11:06] what caused it to work? [15:11:10] huh [15:11:16] * chasemp bangs head on wall [15:11:16] and literaly it is caused to work [15:11:22] it is not fix [15:11:30] so if we try the command again, it will fail again? [15:11:32] not because it shouldn't work [15:11:44] but because it should work in the first place [15:11:44] marostegui: seems to work thereafter for at least 4 tries [15:11:49] gosh... [15:11:51] maintain-views --databases wbwikimedia --replace-all --debug that is [15:11:59] let me try the rest [15:12:17] my theory [15:12:32] is that some cache gets old for dbs that are not yet created [15:12:51] so the permissions are non existent for new dbs [15:12:52] so on 1003 the view creation now worked for all 4 new wiki's [15:12:53] in other workds [15:13:11] we have to create the user when adding new dbs [15:13:15] that is a bug [15:13:21] * chasemp nods [15:13:27] and non existent on the new servers [15:13:33] because we do not use wildcards anymore [15:13:40] we use user-defined lists [15:13:48] initially for security reasons [15:14:01] but probably there is a bug with wildcards [15:14:14] at least on the 10.0.X series [15:14:36] I noted what fixed it and that I ran things https://phabricator.wikimedia.org/T164103#3235624 [15:14:59] thanks [15:15:06] and again, sorry for that hapepning [15:15:07] so just dropping + recreating the maintainviews user fixes the issue…but at some point some months ago the issue was also fixed with flush privileges if I recall correctly? [15:15:18] maybe [15:15:37] it reached a point of frustration that I was just trying things randomly [15:15:49] next time let's try flushing mysql.user table [15:15:53] and if it is cache related [15:15:53] don't know.. [15:15:56] who knows [15:16:01] if it is that I say [15:16:11] it should be the database_priv [15:16:21] *what [15:16:31] we can try both XDD [15:16:32] we can search the bug database [15:16:58] well we know it's at least reocurring and not a total one-off by this point so worth stepping through more slowly next round [15:17:07] I'll ping one you guys from the beginning I think next new wiki db [15:17:46] last time we were more lost [15:17:54] because there were other real-issues mixed [15:17:59] grants missing, etc. [15:18:04] definitely [15:18:25] even this time there was a "user operator" error [15:18:46] well, more like script/bad error message by mysql [15:19:04] hey it's better than teh days when we just didn't do it as no one knew how :D [15:19:07] but it was a real problem (trying to create a view for a database that didnt exist) [15:19:34] now we know more or less where it is coming from [17:00:28] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140543 (10RobH) 05stalled>03Resolved Yes, this is a new system, so it's reclaimed to spares, not decommissioned... [17:45:26] jynus: am I correct in thinking 100% requests to a 'replica' for x1 are currently going to db1029.eqiad.wmnet ? [17:45:53] see my comment [17:46:30] *looks* [17:47:55] wtf.... [17:48:09] I think it is too bad to be true [17:48:21] but I wanted to share it first [17:48:48] Okay, lets see if any of these selects are taking an unusual amount of time too [17:49:43] the only long running query I see on that db is SELECT * FROM `cognate_pages`, but that was 1 instance run by me [17:50:03] I see 39724 "SELECT `cgsi_interwiki` , `cgp ... ( `cgpa_site` = `cgsi_key` )" [17:50:35] but none last run in the last 8 hours [17:51:12] I get no queries on db1029 [17:51:32] although I see a bunch of idle connections [17:51:51] not necessarily idle [17:52:01] could be fast queries [17:52:16] but I do not see them registering queries on the db monitoring [17:53:11] I got one SELECT /* Cognate\CognateStore::selectLinkDetailsForPage */ [17:53:29] the monitoring seems to not be working [17:53:35] something is wrong there [17:53:45] so my comment was a false alarm [17:56:26] "I got one SELECT /* Cognate\CognateStore::selectLinkDetailsForPage */" yup, thats essentially the only query running [17:56:39] so jynus queries are not just going to the master? or? [17:56:46] no [17:56:49] those are old queries [17:57:07] just there are not queries being properly monitored on the slave [17:57:16] okay [17:58:03] maybe it doesn't register queries lower than some threshold or something [17:58:27] most queries are taking less than 3ms [17:58:34] Per https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate-repo [17:59:23] enable writes on 2 small wikis, see what happens [18:01:24] ack! [18:03:01] mysql tables in use 3, locked 0 [18:04:37] so in the past 30 mins there have only been 3 writes blocked by being in readonly [18:05:04] as the dashboard links counts the write attempts before the read only check [18:05:09] *linked [18:06:07] there are high rates of ps_check_lost_instrumentation [18:06:23] that would explain the lost intrumentation [18:06:27] * addshore does not know what ps_check_lost_instrumentation is [18:06:37] don't worry about it [18:07:52] with such low write rates (1 every 10 minutes) do you see an issue with turning readonly off for all wiktionaries for a period? or perhaps off for enwiktionary and frwiktionary (the two biggest) to make sure we see some writes? [18:08:14] firs small ones [18:08:41] it took only 2 minutes to create issues [18:08:56] so it wasn't a small ramp-up like cxtranslation [18:08:56] Ahh, I was looking at the average per minute rather than total, so average 3 writes per minute, 170 in the past 40 ish mins [18:10:55] I could turn it on for all 'small' wiktionaries, which would be 87 projects. Or choose a couple of 'medium' projects ? [18:11:29] up to you [18:12:46] the connections seem to be going all to the slave [18:12:52] so that is good [18:17:02] * DanielK_WMDE still doesn't see what caused things to fall over in the first place [18:23:35] I see some deadlocks [18:24:26] it is echo [18:24:28] i only see one write every few minutes on grafana. [18:24:34] ah [18:24:51] I am monitoring locking [18:25:12] do writes happen on every edit? [18:25:31] what is the model? [18:25:32] no. page creation, deletion, redirection, un-redirection. [18:25:47] and only for wiktionaries [18:25:48]