[02:48:57] 10DBA, 10Phabricator, 10Patch-For-Review: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3603057 (10jcrespo) a:03jcrespo @mmodell We have to upgrade the hardware for phabricator databases. What do you think of doing also this thursday a master switchover and upgrade to stretch/mariadb... [04:26:31] 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603280 (10jcrespo) [04:27:34] 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3599964 (10jcrespo) [04:27:38] 10DBA, 10Operations, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3603287 (10jcrespo) [04:36:50] 10DBA, 10Phabricator, 10Patch-For-Review: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603291 (10jcrespo) db1048 is now ready to be decommissioned, it is set as spare, but it still needs to be fully deleted from the configuration and infrastructure (installer... [04:37:02] 10DBA, 10Phabricator, 10Patch-For-Review: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603293 (10jcrespo) p:05Normal>03Low [04:37:23] 10DBA, 10Operations, 10Phabricator, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603295 (10jcrespo) [05:03:56] the replag on the new Cloud mariadb boxes looks kind of funky -- https://tools.wmflabs.org/replag/ -- it seems to bounce around a lot if you reload repeatedly. [05:04:19] I'm wondering if the number there is not really seconds but something much smaller. [09:28:56] yeah, now that I think, I created the new views as microseconds/1000 [09:29:11] while it should be microseconds/10^6 [09:42:55] bd808: working now, I have tested by creating replicating lag [11:27:55] db1070 is very unhappy [11:28:14] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-1h&to=now [11:30:43] let me see [11:32:24] happening since 10:30, apprently [11:32:39] do you think it is worth pausing the script as a test thing? [11:33:06] (based on the timestamp) [11:34:00] let me try something first [11:34:14] jynus: IDK, if you think I should. I can do it [11:34:32] jynus: I can make it slower [11:36:07] I think I will disable semisync on db1082, db1087, db1092, db1096, db1099, and db1100 [11:37:51] I am also not sure it should have a 50 load [11:37:59] being the slow slave [11:44:57] Amir1: check if there is any change on s5 lagging [11:46:01] I will also remove load from db1070 [11:52:47] jynus: It seems all slaves are lagging behind, db1070 is the worst [11:54:11] yeah, even db1082, which is our largest server size [11:54:31] let me merge ongoing patches [11:54:38] and we will probably pause the script [11:55:05] (assuming it is that based on the times, but it could be something else, or the script on top of something else) [11:56:09] okay, also we can make it sleep more [11:56:17] your choice [11:56:37] my point is that if we pause and it recovers, we can do that [11:56:48] if we pause and it continues, it is nto the script, so no point [11:57:29] meanwhile, can you check the impact on users, e.g. if recentchanges get behind or something [11:57:47] and/or if there is some bot with intensive writes [11:58:47] jynus: sure [11:59:16] It is not getting better: :-/ https://logstash.wikimedia.org/goto/19b61d24a1798262475773fcf88e7581 [12:00:02] 6 seconds for db1071 [12:01:43] it could be getting better now? [12:02:16] (there is some lag between my actions and monitoring, probably) [12:02:51] jynus: there is one user with unbelievable speed [12:02:59] there are two with really high [12:03:11] what is unbelievable? [12:03:20] > 600 per minute? [12:03:59] 300 [12:04:43] let's try shutting down the script, even if it is not the source [12:04:51] it could help reduce writes [12:05:06] okay [12:05:13] I warned the user [12:05:36] no need yet [12:06:10] the idea being that, we try to solve it first on our own, even if "etiquete" is being violated [12:06:35] then we bother editors [12:08:07] jynus: the user throttled it now [12:08:44] that is bad [12:08:56] I stopped the script, now we do not know which of the 2 is [12:09:25] jynus: funnily enough, it's not recovering [12:09:31] it is [12:09:43] logstash is more up to date [12:10:00] grafana has up to 2 minutes of lag due to aggregation [12:10:13] oh thanks for telling me [12:10:19] noted [12:10:33] it only shows data every minute [12:10:43] of the last whole minute [12:11:01] so 2 or more minutes to observe things, at least for mysql metrics [12:11:19] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-12h&to=now [12:11:20] plus it uses mysql show slave status, which is not 100% reliable [12:11:32] I am looking at mediawiki reports [12:11:37] It might not be related but also jobqueue started to grow badly [12:11:48] which uses pt-table-chesum and it should be more frequently collected [12:12:02] Amir1: I think it is totally related [12:12:15] but not with the script, with high edit rate [12:12:41] the script happens to create more overhead on top of existing high thorughput [12:12:47] that is my believe [12:17:25] the other hosts seem ok now, db1070 seem stil unusually slow, I will check if it has hw issues or something [12:18:52] I think some of the slow queries may be hitting it too hard [12:22:44] yeah, the long query combined with the high update rate made it increase the purge lag a lot, reducing significantly its performance [12:22:51] hardwar is ok [12:23:58] maybe we should pool it with 0 weight? [12:24:15] https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1070&var-port=9104&from=1505287342465&to=1505305264566 [12:24:27] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1070&var-port=9104&from=1505287342465&to=1505305264566 [12:40:43] jynus: the lags are back [12:54:44] 10DBA, 10MediaWiki-Database, 10Availability, 10Performance-Team (Radar): wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged - https://phabricator.wikimedia.org/T95799#3604189 (10Peter) [13:05:34] and it aligns with the start of the script, let's slow it down [14:16:03] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3604472 (10Papaul) @jcrespo this is assigned to me is there anything i have to do on my side? Thanks. [14:17:40] 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3604473 (10jcrespo) a:05Papaul>03None I think the assignment is an accident because it was created as a subticket of another ticket; nothing to do here yet for you. Sorry for the distraction. [17:20:11] 10DBA, 10Operations, 10Phabricator, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3605096 (10mmodell) @jcrespo Any time will work for me, there is scheduled maintenance at midnight tonight (UTC) but if it's just a few seconds of downtime I think... [17:53:50] 10DBA, 10Operations, 10Phabricator, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3605223 (10jcrespo) Let's wait a bit more. I may have to talk to you abut setting up TLS for php and changing passwords, let's talk and aim for next week (but we sh...