[02:48:57] <wikibugs>	 10DBA, 10Phabricator, 10Patch-For-Review: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3603057 (10jcrespo) a:03jcrespo @mmodell We have to upgrade the hardware for phabricator databases. What do you think of doing also this thursday a master switchover and upgrade to stretch/mariadb...
[04:26:31] <wikibugs>	 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603280 (10jcrespo)
[04:27:34] <wikibugs>	 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3599964 (10jcrespo)
[04:27:38] <wikibugs>	 10DBA, 10Operations, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3603287 (10jcrespo)
[04:36:50] <wikibugs>	 10DBA, 10Phabricator, 10Patch-For-Review: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603291 (10jcrespo) db1048 is now ready to be decommissioned, it is set as spare, but it still needs to be fully deleted from the configuration and infrastructure (installer...
[04:37:02] <wikibugs>	 10DBA, 10Phabricator, 10Patch-For-Review: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603293 (10jcrespo) p:05Normal>03Low
[04:37:23] <wikibugs>	 10DBA, 10Operations, 10Phabricator, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3603295 (10jcrespo)
[05:03:56] <bd808>	 the replag on the new Cloud mariadb boxes looks kind of funky -- https://tools.wmflabs.org/replag/ -- it seems to bounce around a lot if you reload repeatedly.
[05:04:19] <bd808>	 I'm wondering if the number there is not really seconds but something much smaller.
[09:28:56] <jynus>	 yeah, now that I think, I created the new views as microseconds/1000
[09:29:11] <jynus>	 while it should be microseconds/10^6
[09:42:55] <jynus>	 bd808: working now, I have tested by creating replicating lag
[11:27:55] <Amir1>	 db1070 is very unhappy 
[11:28:14] <Amir1>	 https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-1h&to=now
[11:30:43] <jynus>	 let me see
[11:32:24] <jynus>	 happening since 10:30, apprently
[11:32:39] <jynus>	 do you think it is worth pausing the script as a test thing?
[11:33:06] <jynus>	 (based on the timestamp)
[11:34:00] <jynus>	 let me try something first
[11:34:14] <Amir1>	 jynus: IDK, if you think I should. I can do it
[11:34:32] <Amir1>	 jynus: I can make it slower
[11:36:07] <jynus>	 I think I will disable semisync on db1082, db1087, db1092, db1096, db1099, and db1100
[11:37:51] <jynus>	 I am also not sure it should have a 50 load
[11:37:59] <jynus>	 being the slow slave
[11:44:57] <jynus>	 Amir1: check if there is any change on s5 lagging
[11:46:01] <jynus>	 I will also remove load from db1070
[11:52:47] <Amir1>	 jynus: It seems all slaves are lagging behind, db1070 is the worst 
[11:54:11] <jynus>	 yeah, even db1082, which is our largest server size
[11:54:31] <jynus>	 let me merge ongoing patches
[11:54:38] <jynus>	 and we will probably pause the script
[11:55:05] <jynus>	 (assuming it is that based on the times, but it could be something else, or the script on top of something else)
[11:56:09] <Amir1>	 okay, also we can make it sleep more
[11:56:17] <Amir1>	 your choice 
[11:56:37] <jynus>	 my point is that if we pause and it recovers, we can do that
[11:56:48] <jynus>	 if we pause and it continues, it is nto the script, so no point
[11:57:29] <jynus>	 meanwhile, can you check the impact on users, e.g. if recentchanges get behind or something
[11:57:47] <jynus>	 and/or if there is some bot with intensive writes
[11:58:47] <Amir1>	 jynus: sure
[11:59:16] <jynus>	 It is not getting better: :-/ https://logstash.wikimedia.org/goto/19b61d24a1798262475773fcf88e7581
[12:00:02] <jynus>	 6 seconds for db1071
[12:01:43] <jynus>	 it could be getting better now?
[12:02:16] <jynus>	 (there is some lag between my actions and monitoring, probably)
[12:02:51] <Amir1>	 jynus: there is one user with unbelievable speed 
[12:02:59] <Amir1>	 there are two with really high
[12:03:11] <jynus>	 what is unbelievable?
[12:03:20] <jynus>	 > 600 per minute?
[12:03:59] <Amir1>	 300
[12:04:43] <jynus>	 let's try shutting down the script, even if it is not the source
[12:04:51] <jynus>	 it could help reduce writes
[12:05:06] <Amir1>	 okay
[12:05:13] <Amir1>	 I warned the user
[12:05:36] <jynus>	 no need yet
[12:06:10] <jynus>	 the idea being that, we try to solve it first on our own, even if "etiquete" is being violated
[12:06:35] <jynus>	 then we bother editors
[12:08:07] <Amir1>	 jynus: the user throttled it now
[12:08:44] <jynus>	 that is bad
[12:08:56] <jynus>	 I stopped the script, now we do not know which of the 2 is
[12:09:25] <Amir1>	 jynus: funnily enough, it's not recovering 
[12:09:31] <jynus>	 it is
[12:09:43] <jynus>	 logstash is more up to date
[12:10:00] <jynus>	 grafana has up to 2 minutes of lag due to aggregation
[12:10:13] <Amir1>	 oh thanks for telling me
[12:10:19] <Amir1>	 noted
[12:10:33] <jynus>	 it only shows data every minute
[12:10:43] <jynus>	 of the last whole minute
[12:11:01] <jynus>	 so 2 or more minutes to observe things, at least for mysql metrics
[12:11:19] <Amir1>	 https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-12h&to=now
[12:11:20] <jynus>	 plus it uses mysql show slave status, which is not 100% reliable
[12:11:32] <jynus>	 I am looking at mediawiki reports
[12:11:37] <Amir1>	 It might not be related but also jobqueue started to grow badly 
[12:11:48] <jynus>	 which uses pt-table-chesum and it should be more frequently collected
[12:12:02] <jynus>	 Amir1: I think it is totally related
[12:12:15] <jynus>	 but not with the script, with high edit rate
[12:12:41] <jynus>	 the script happens to create more overhead on top of existing high thorughput
[12:12:47] <jynus>	 that is my believe
[12:17:25] <jynus>	 the other hosts seem ok now, db1070 seem stil unusually slow, I will check if it has hw issues or something
[12:18:52] <jynus>	 I think some of the slow queries may be hitting it too hard
[12:22:44] <jynus>	 yeah, the long query combined with the high update rate made it increase the purge lag a lot, reducing significantly its performance
[12:22:51] <jynus>	 hardwar is ok
[12:23:58] <jynus>	 maybe we should pool it with 0 weight?
[12:24:15] <jynus>	 https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1070&var-port=9104&from=1505287342465&to=1505305264566
[12:24:27] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1070&var-port=9104&from=1505287342465&to=1505305264566
[12:40:43] <Amir1>	 jynus: the lags are back
[12:54:44] <wikibugs>	 10DBA, 10MediaWiki-Database, 10Availability, 10Performance-Team (Radar): wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged - https://phabricator.wikimedia.org/T95799#3604189 (10Peter)
[13:05:34] <jynus>	 and it aligns with the start of the script, let's slow it down
[14:16:03] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3604472 (10Papaul) @jcrespo this is assigned to me is there anything i have to do on my side?   Thanks.
[14:17:40] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3604473 (10jcrespo) a:05Papaul>03None I think the assignment is an accident because it was created as a subticket of another ticket; nothing to do here yet for you. Sorry for the distraction.
[17:20:11] <wikibugs>	 10DBA, 10Operations, 10Phabricator, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3605096 (10mmodell) @jcrespo Any time will work for me, there is scheduled maintenance at midnight tonight (UTC) but if it's just a few seconds of downtime I think...
[17:53:50] <wikibugs>	 10DBA, 10Operations, 10Phabricator, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3605223 (10jcrespo) Let's wait a bit more. I may have to talk to you abut setting up TLS for php and changing passwords, let's talk and aim for next week (but we sh...