[06:05:36] 10DBA, 06Operations: Increase timeout for mariadb replication check - https://phabricator.wikimedia.org/T163303#3192927 (10Marostegui) [07:00:56] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 07Schema-change: Add flaggedrevs.fr_user index for UserMerge queries - https://phabricator.wikimedia.org/T105398#3193025 (10Marostegui) Is this still needed? [07:03:13] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 07Schema-change: Add flaggedrevs.fr_user index for UserMerge queries - https://phabricator.wikimedia.org/T105398#3193026 (10hoo) >>! In T105398#3193023, @Marostegui wrote: > Is this still needed? We mostly gave up working on Use... [07:11:49] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10MediaWiki-extensions-UserMerge, 07Schema-change: Add flaggedrevs.fr_user index for UserMerge queries - https://phabricator.wikimedia.org/T105398#3193033 (10Marostegui) 05Open>03declined Thanks! I will decline it for now, if in the end this is needed, let's f... [10:54:13] <_joe_> jynus, marostegui: do you think running the warmup task on codfw would be an issue for you? [10:54:20] <_joe_> the db requests will peak for a bit [10:54:50] _joe_, we are actually running our own right now :-) [10:55:06] as we speak in a meeting :) [10:55:39] <_joe_> yeah I know, is me running the appservers warmup task now gonna be a problem? [10:55:47] nope [10:56:15] <_joe_> we didn't do a full run since we removed the api warmup, so it's a good idea to just do it to test [10:56:18] <_joe_> ok, thanks! [10:56:23] <_joe_> starting in 2 minutes or so [11:01:22] 10DBA: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3193235 (10Marostegui) @ayounsi have you guys thought when you want do to this? We are trying to organize ourselves with the days eqiad is going to be stand by. [11:05:51] <_joe_> I'm done btw, thanks a lot [11:51:02] there's a few mysql servers which are still on Linux 3.19: db1077, es1011, es1014, es1015,es1016, es1018, es1019, pc1004, pc1005, pc1006. Could we use the forthcoming "eqiad maintenance window" to upgrade these to 4.4 (or ideally 4.9)? [11:52:07] moritzm: Feel free to add it to: https://wikitech.wikimedia.org/wiki/Switch_Datacenter/planned_db_maintenance (we have lots of tasks to do already, but at least if you can, list it there and we will see if we can get it done) [11:52:47] We have some high priority ones already, so we'll see how we handle the time :) [11:53:49] ok, will add it there, thanks [11:54:52] thanks! [11:58:57] moritzm those are jessie? [12:00:33] yes, all of them [12:02:06] I would do the masters only [12:02:19] the others can be done easily at another time [12:04:55] ok, only es1011 and es1014 are masters, I'll trim the list on the wikitech page [12:05:09] thanks :) [12:06:06] moritzm, external storage only have a redundancy of 3 per dc [12:06:27] so we are extremely conservative, more than regular dbs [12:07:09] 10DBA: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3193378 (10ayounsi) Scheduled date is the 26th (T148506#3171998). I have communication to be sent to ops drafted. [12:20:15] 10DBA: Network maintenance on row D (databases) - https://phabricator.wikimedia.org/T162681#3193389 (10Marostegui) Cool, I will talk to Jaime tomorrow in our weekly meeting and we will try to see how to fit our stuff before/after it. I will keep you posted - thanks!! [13:54:36] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3193571 (10jcrespo) watchlist looks right everwhere I looked, except dbstore1002, which seems to have errors or undeleted rows. [13:55:27] let's not use this channel for the next hour or so [13:57:17] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3193590 (10Marostegui) Only oldimage pending from this shard then? [14:01:39] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3193601 (10jcrespo) dbstore1002:watchlist and oldimage everywhere, for the hosts I could or wanted to do (normally that means non-lagged dbstores, all old dbs, old master eqiad, new master eqiad and master codfw).... [14:02:38] 10DBA, 13Patch-For-Review: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509#3193606 (10Marostegui) >>! In T160509#3193601, @jcrespo wrote: > dbstore1002:watchlist and oldimage everywhere, for the hosts I could or wanted to do (normally that means non-lagged dbstores, all old dbs, old mast... [16:30:53] I'm checking the load on various shards, I'll report here the findings [16:32:36] - db2064 (s2 normal slave) has load 22, while db2049 (same role) has load 3 [16:34:56] could it be buffer pool being dirty? [16:35:19] load is not that important [16:35:19] still checking all the shards, the others are loaded too [16:35:25] more load could be better [16:35:30] but equally [16:35:38] if the weight is the same [16:35:41] and have the same role [16:36:15] one is on jessie and 10.0.29 and the other trusty and 10.0.22 [16:36:26] the disk utilization on both is pretty much the same (db2064 and db2049) [16:36:27] similar QPS [16:36:31] and has beeing going down slowly [16:43:02] re: one is on jessie and 10.0.29 and the other trusty and 10.0.22 -> db2062 and db2069 are both on 10.0.22 and 10.0.23 [16:44:14] marostegui: I guess jaime was referring to db2064 and db2049 [16:44:27] the older being the one with less load :D [16:44:53] yeah, but I was trying to see if all the slaves overloaded were having the same pattern :) [16:44:56] with the versions [16:45:30] ah ok, in the other shards there is usually only one with weight 400 so it's not comparable with another one in the same shard :D [16:46:15] ok the problem is actually db2049 [16:46:51] 10DBA, 06Operations, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194385 (10faidon) [16:46:53] if you do top you can see that %CPU used by mysql is flapping between 0 and a real value [16:47:05] I guess that is why the load is shown low but might be "fake" [16:48:13] that is weird [16:49:46] I'm checking other metrics [16:50:14] I am checking the config between db1072 and db2062 [16:50:18] I am checking https://grafana.wikimedia.org/dashboard/db/performance-metrics?refresh=5m&orgId=1 [16:51:24] latencies are slowly going down [16:52:35] all the mysql dahsboard graphs for db2064/db2049 are of the same shape [16:52:40] and coherent [16:52:48] many of those hosts have 40 CPUs [16:52:54] yes both of them [16:52:58] checked [16:53:14] while newer machines only have a handful of them [16:53:21] so I guess the load is just wrong reported and probaby will need a reboot at some point [16:53:22] load cannot be compared 1:1 [16:53:32] what about disk utilization? [16:53:32] but I guess can wait after the switch back [16:53:38] it is ok, it is just not comparable [16:53:52] if the load is wrong, could it also the iowait being reported wrongly? [16:54:19] sadly we do not have io on https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown [16:54:58] the IO seems the same on server-board [16:55:13] yes i am using the server-board one [16:55:34] but iops are low [16:55:38] I do not see a problem there [16:55:49] 1K-2K OPS [16:56:10] disk will be less idle on non-ssd disk [16:56:13] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194442 (10RobH) >>! In T163339#3194384, @Papaul wrote: > db2043 is on xy (c6) > db2061 is on xz (d6) Both of those db hosts are slaves in the s3 and s7 shards, not... [16:56:59] but it should be more or less the same as db1072, no? [16:57:54] not really [16:58:06] eqiad has 3 SSD servers doing 500 load each [16:58:13] "load" [16:58:35] those 50 are not comparable [16:58:41] we can reduce the main load [16:58:53] so that servers are more specific [16:59:15] also there used to be more long-running queries before [16:59:24] now it is serving mostly fast queries [16:59:53] the thing is, are we slowing things down a lot by having these servers this loaded? [17:00:08] or are things alright and these servers are just being "used" [17:00:10] more than the others [17:00:37] I think a bit of both [17:00:53] we can aliviate a bit things by setting things with lower consistency [17:00:58] until log catches up [17:01:48] checkpoint age is white high [17:01:57] reducing consistency will reduce io [17:01:57] you talking about trx? [17:02:02] and then everhing will be smoother [17:02:13] and we can enable it afterwards [17:02:15] let me try [17:02:23] sure, let's try on one host [17:02:37] db2049? [17:02:43] or is there any worse? [17:02:50] db2062 or db2069 i would say [17:04:37] I also see long running queries with many fileshorts [17:05:10] i checked some of those in db1072 and they had the same plan [17:06:08] nope [17:06:29] which one did you check? [17:06:47] i checked some for the revision table [17:07:09] https://phabricator.wikimedia.org/P5293 [17:07:30] how's that possible if the table is the same? :| [17:07:33] it may be just a question of running ANALYZE [17:07:38] mmmmm [17:07:41] good point [17:07:43] because of the new indexes [17:08:47] one takes minutes [17:08:54] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194453 (10RobH) Ok, mw2215 has been moved, but a3 is still unhappy: X 14.5 Y 8.3, Z 9.6 So now X is quite high, while Z is back to a more normal rate. [17:08:57] the other 0.00 seconds [17:09:24] so we depool one server at a time, and then run analyze [17:09:28] which will take 3 days [17:09:32] I am trying to find the query I checked [17:09:58] or we can add some force temporarrily [17:10:12] how long did it take last time to run the analyze? [17:10:33] 3 days [17:10:41] + catchup [17:10:46] *catching up [17:11:01] Then it is probably faster to use the force I guess? [17:12:01] or updating the index statistics on the engine independent statistics [17:12:08] (it could work) [17:12:25] if we copy it from the api server from enwiki and put it on the other server [17:12:28] I have never done that so I trust you :) [17:13:50] if that works, that is a hell of a hack! [17:15:06] marostegui, please search one server where that does happen clearly [17:15:08] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194475 (10RobH) >>! In T163339#3194453, @RobH wrote: > Ok, mw2215 has been moved, but a3 is still unhappy: > > X 14.5 Y 8.3, Z 9.6 > > So now X is quite high, whil... [17:15:11] the one I mentioned [17:15:13] or another [17:15:21] and I will try to copy from one where it works [17:15:41] so db2062 and db2069 do have that issue [17:15:48] I was checking for one with a small revision table [17:16:09] where maybe we can try the analyze and see if it fixes it [17:16:17] but for now, db2062 and db2069 are the ones [17:16:21] but let me confirm [17:17:12] not sure if export from db1072 or the last server we ran analyze on [17:17:37] well, if it doesn'¡t happen on db1072, it should be enough no? [17:17:39] to see improvements [17:17:50] maybe it was 72 itself? [17:18:36] would that be logged? [17:18:39] on SAL? [17:18:52] the last thing i see [17:18:55] is me pooling it back [17:18:57] after the alter [17:19:13] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194503 (10Papaul) I have no opening on yz [17:19:56] let's try with db1072 [17:20:06] if it fixes the query you pasted, it means it worked I guess [17:20:55] it requires a mysql patch, but not sure if it has been applied yet [17:21:16] maybe it was on 10.0.23 [17:24:57] marostegui: https://phabricator.wikimedia.org/P5293#28323 [17:25:16] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194534 (10Papaul) moving msw-a3-codfw from yz to xy [17:25:17] nice [17:26:19] 10DBA, 06Operations, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194538 (10Marostegui) Update we have found that the query plan isn't the same for all the queries: https://phabricator.wikimedia.org/P5293 We believe this is cause... [17:26:32] then, according to documentation, I have to run flush tables :-( [17:26:33] ^ updated that so p*ravoid knows what is goping on [17:26:37] oh :( [17:26:49] and of course, things can get worse [17:27:00] even without having into account that [17:27:08] how do you see it? [17:27:48] I think we should try [17:28:07] Maybe on a different wiki? [17:28:09] a smaller one? [17:28:27] no, the issue only happens here [17:28:46] if things get much worse, we depool and do it old fashioned [17:29:02] yeah [17:29:06] you want me to have the patch ready? [17:29:57] no, have the kill queries ready [17:30:02] for the metadata locking [17:30:27] ok, one sec [17:30:29] which host? [17:30:30] ok, it only took 3 seconds [17:30:31] db2062? [17:30:32] :-) [17:30:33] oh [17:30:34] nice [17:30:34] XD [17:30:48] did it work? [17:30:58] db2062 still showing filesort :( [17:31:28] it didn't, still fileshorts [17:31:40] yes :( [17:32:43] althought now other queries are bad [17:33:08] let's depool and run analyze [17:37:38] I am playing with SET GLOBAL use_stat_tables [17:37:45] to see if it gets better or worse [17:38:03] I am getting the patch ready just in case we decide to run it [17:39:47] https://gerrit.wikimedia.org/r/#/c/348970/ that is it [17:40:04] I think db2069 is a bit worse now? [17:41:05] the graphs look still pretty much the same as db2062 [17:41:48] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194681 (10RobH) Ok, things shifted drastically X/Y/Z are at 9/9/14. I've suggested to @papaul we pick one mw system off xz, and one off yz, and move them both onto... [17:43:01] I've just run flush tables again, just in case [17:43:17] i still see filesorts :( [17:47:22] 10DBA, 06Operations, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194699 (10Marostegui) The first hack hasn't worked as expected. We are thinking about just depooling the slave and run the normal analyze tab... [17:53:55] I will be back in around 30 minutes, need to eat something, I am starving [17:54:43] the patch is there if you decide to go ahead an run the analyze (I am all forward leave it running today as the hack didn't work unfortunately) [18:16:23] jynus: i am back, shall I deploy and run analzye? [18:18:39] I don't know [18:19:14] we don't have much alternatives [18:19:38] leave it as is [18:19:59] I thought, what if we pool db1072 while we take one of the codfw slaves out? is that too crazy in terms of latency? [18:20:18] what is worse, the current state or the one with one less server? [18:20:28] yes, that is my fear [18:21:05] well, we can try to depool it and see what happens to the other two over night [18:21:17] depool without running anything in case we need it back [18:22:11] let's do one thing [18:22:17] let's leave it like this for now [18:22:18] depool without running anything is the worse of the 3 alternatives [18:22:26] tomorrow morning I will depool it [18:22:30] and see how the other behave [18:22:47] I don't think it is the worse of the 3 alternatives, because that way we can see how the other two behave alone [18:22:51] and if needed, we can pool it back quickly [18:22:59] and it won't be delayed or anything [18:23:07] and so we can if we runn analyze, except for the lag [18:23:31] yes, that is the thing, that if we found out the other two cannot handle it, we cannot pool it back because it will be delayed [18:25:40] 10DBA, 06Operations, 10netops, 10ops-codfw: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3194902 (10RobH) row a done [18:30:15] let's leave it as it is now and discuss tomorrow then [18:30:20] I am going to go off and rest [18:33:33] 10DBA, 06Operations, 10ops-codfw: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3194977 (10Marostegui) >>! In T163339#3194442, @RobH wrote: >>>! In T163339#3194384, @Papaul wrote: >> db2043 is on xy (c6) >> db2061 is on xz (d6) > > Both of those... [19:05:00] 10DBA, 10AbuseFilter, 06Performance-Team, 05MW-1.27-release (WMF-deploy-2015-10-13_(1.27.0-wmf.3)), and 5 others: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557#3195162 (10jcrespo) p:05Normal>03High I think this got worse on failover- the query planer gets confus... [19:05:21] 10DBA, 10AbuseFilter, 06Performance-Team, 05MW-1.27-release (WMF-deploy-2015-10-13_(1.27.0-wmf.3)), and 5 others: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557#3195164 (10jcrespo) [19:05:22] 10DBA, 06Operations, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3195165 (10jcrespo) [19:06:15] 10DBA, 06Operations, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3194385 (10jcrespo) This was and old friend^ we should either index hint, send it to RC or something soon- it is now failing too often. See su... [19:29:07] 10DBA, 10AbuseFilter, 06Performance-Team: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557#3195254 (10Krinkle) [20:17:00] 10DBA, 10Analytics: Json_extract available on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T156681#3195441 (10Tbayer) Any updates on this? With the recent rollout of T153207, which introduced a JSON field to all EventLogging tables, the need for [[https://mariadb.com/kb/en/mariadb/json-func... [20:48:00] 10DBA, 06Operations, 10netops, 10ops-codfw: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3195612 (10RobH) row b done