[02:16:24] system load on labsdb1003 is crazy high (27+) and we've got >24h our replag on s1, s2, and s4. I don't know how, or if I even can, see what queries are running and causing the load. [02:23:31] can you not login to it as an sql user? [02:32:35] I don't have any privileged account there AFAIK [04:12:00] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#2825398 (10Bawolff) I suspect that fixing this bug will significantly help with T171027 (watchlists being too slow) [10:26:05] 10DBA, 10Data-Services: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] - https://phabricator.wikimedia.org/T175487#3594707 (10jcrespo) From the labsdb1001 log: ``` InnoDB: We intentionally crash the server, because it appears to be hung. 2017-09-09 18:21:02 7fe8cdbfe700 InnoDB: Asse... [10:33:31] 10DBA, 10Data-Services: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] - https://phabricator.wikimedia.org/T175487#3595894 (10jcrespo) The actual replication problem comes from labsdb1001 running out of space over the weeked- I will search what is to be blamed. I doubt it is the repl... [13:47:30] Amir1: around? [13:47:36] jynus: yup [13:48:34] sorry, note sure if I pinged the wrong person [13:48:52] amir ladsgroup? [13:48:57] yup, that's me [13:49:06] We talk every day :) [13:49:22] yeah, with a different IRC nick every day :-) [13:49:40] and there are a couple of amirs on WMF community :-) [13:49:49] Most of the times my nick is Amir1, I changed it to goatification once :D [13:49:53] ok ok [13:50:25] so I wanted to ask how is the wikidata job going? [13:50:49] jobqueue issue or the puppet thing? [13:50:53] maybe you have a better prevision of it finishing after running [13:51:07] no it is becase database pressure on s5 [13:51:07] (puppet cronjob) [13:51:21] after running for some days [13:51:26] oh, that. It's in Q17M days [13:51:38] so it's around 50% done [13:51:39] what is that roughly in time? [13:51:51] I do not need an exact data [13:51:56] just an appriximation [13:52:23] my guess is around two weeks [13:52:26] or less [13:52:43] I can make it a little bit (around 10% faster) if that's not putting pressure on s5 [13:52:49] always give your worst possible scenario [13:53:01] for sysadmin-related stuff :-) [13:53:13] 2-3 weeks? [13:54:20] three weeks tops [13:54:29] I think our backups are getting a bit delayed, around 1 day per week [13:54:35] due to s5 [13:54:44] we can wait around 4 weeks [13:55:09] but if it goes over that we may have to pause some writes for some time [13:55:26] okay, is there any plans of moving out dewiki? It makes me a little bit worried about that wiki (high lag, etc.) [13:55:33] yes [13:55:42] most likely that will be done on Q2 [13:55:52] technically this is a good thing [13:55:53] Awesome, thanks [13:56:04] because it means that wikidata is going faster [13:56:14] without affecting production [13:56:21] but it is affecting old backup hosts [13:56:29] we are also going to renew those [13:58:08] this is the graph: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1001&var-port=9104&from=now-7d&to=now [13:59:07] it actually recovers, but by the time lag goes down, the new backup starts [14:00:18] okay I'm thinking about making the job faster (to reduce the time to two weeks) [14:00:24] the lag will go up I guess [14:00:38] yeah, I would not touch it [14:00:51] I am not asking you to do anything [14:01:04] I just wanted to know if it was going to take 2 months more [14:01:11] nooo [14:01:19] in which case we would have to do something [14:01:47] okay, let me know if it's not going well [14:01:55] 2 weeks is ok, specially as this host is supposed to be one day behind [14:03:25] if elukey is also around [14:04:07] have a look at my conversation with Amir, this is probably related to the high iops you are "suffering" on dbstore1002 (T156933) [14:04:08] T156933: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933 [14:05:28] sorry, luca I think there you mention db1047, which does not replicate s5 [14:09:28] dbstore1002 is also showing up a high disk usage all the time, reading thanks :) [14:48:54] jynus: is the high s1, s2, & s5 replag on labsdb anything that I/you can do something about? I don't know how to troubleshoot beyond seeing what seems like high system load on the boxes [14:49:42] it is fixed [14:49:55] I just cannot magically roll it forward [14:50:02] if it was 40 hours behind [14:50:20] ok. so it should be slowly catching up now? [14:50:25] https://tools.wmflabs.org/replag/ [14:50:42] s2 and s4 were 40 hours behind, now they are 20 [14:50:52] improvement! [14:50:55] s1 has contention, probably due to users [14:51:00] right [14:51:08] we can ban certain users so it catches up faster [14:51:09] working on that, slowly. [14:51:11] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3597000 (10Papaul) @jcrespo on db2010 I have 5 bad disks is there any particular order you will want them replaced? [14:51:13] (temporarilly) [14:54:05] On the plus side, labsdb10{09,10,11} continue to look good. I need to keep working to get traffic to move over to them [14:56:08] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3597034 (10jcrespo) 5? wow. I would say 1 at a time, and we check they rebuild correctly. Do not necessarily wait, we can do a couple per day when you are around (normally it takes a few hours to rebuild eac... [14:56:26] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3597035 (10Papaul) p:05Triage>03High [14:58:15] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587259 (10Papaul) Ok i have I have disk in slot 5 replaced [15:07:18] bd808: do you know what would help? [15:07:42] (and relatively easy) [15:08:02] hack https://tools.wmflabs.org/replag/ so it shows all hosts [15:08:21] so people see "ok, labsdb1001 is broken, but I can use this other hosts" [15:09:20] e.g. only one of the 2 old hosts broke, but I am not sure people that need low replication lag realize they can use temporarilly a different one [15:09:26] or they just use quarry [15:20:28] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3597180 (10eranroz) >>! In T151717#3595278, @Bawolff wrote: > I suspect that fixing this bug will significantly help w... [15:34:39] jynus: that's a good idea. I'll see if I can work on that a bit today/tomorrow. [15:39:16] if we add the new hosts at the same time, you do not have to work twice :-) [15:40:13] Yeah, I've been meaning to add in the 9/10/11 hosts there for a while. Now is probably a good time to figure out how to do all of that. [16:00:41] 10DBA, 10Data-Services: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] - https://phabricator.wikimedia.org/T175487#3597389 (10jcrespo) So the replicas are catching up without problem https://tools.wmflabs.org/replag/ (right now the lag on s2 and s4 are almost down to 5 hours from 40h... [16:58:31] jynus: I posted a comment on the hangouts chat but then it got closed :) - If the db1046/7 replacements are delivered during the next quarter, will you guys have time (with me helping) to move the eventlogging databases to the new hw ? (just checking to see if that can be listed in analytics' goals) [17:00:40] 10DBA, 10Data-Services: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] - https://phabricator.wikimedia.org/T175487#3597682 (10jcrespo) I had to kill a few queries that were blocking the replication- this should only happen one time; but applications should be prepared to fail and ret... [17:00:55] sure, that is normal maintenance [17:01:28] I would expect you (as in, analytics) to handle the non-technical bits (announcements, coordinating offline, stopping evenlogging, etc.) [17:01:40] of course [17:01:48] the rest is routine [17:01:53] I'd be also interested in the technical bits if you have patience :D [17:01:53] or it shoudl be [17:01:58] please consider stretch [17:02:03] we have good support [17:02:17] but it implies 10.1 so we are not doing mediawiki yet [17:02:25] but we have labs for a lont time now [17:02:54] * elukey nods [17:03:15] we should upgrade dbstore1002 too [17:03:22] around the same time [17:03:29] glad to help if needed [17:03:30] even if we are not going to replace that yet [17:03:42] well, the technical bits are hard [17:03:52] the moving users is the hard part for us [17:04:03] anouncements, offline, etc. [17:04:23] those are large servers so it could take a day to trasmit all data [17:04:49] it would be nice to have them purged already, however [17:05:03] to avoid working double [17:05:16] or leave them in the same state [17:05:31] as they are now [17:06:24] feel free to ping me for any task that you need help with, I'd like 1) to get more experience with mysql 2) help out with the analytics slaves as much as possible [17:09:59] thanks a lot! going afk :) [18:45:02] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3598222 (10Halfak) I think the idea is that we'll be able to include wbc_entity_usage to increase granularity in watch... [20:33:15] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3598520 (10Bawolff) >>! In T151717#3598222, @Halfak wrote: > I think the idea is that we'll be able to include wbc_ent... [21:43:54] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3598613 (10daniel) >>! In T151717#3598520, @Bawolff wrote: > My hope is that by using fine grained tracking on wbc_ent... [21:58:55] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3598666 (10Bawolff) I suspect that having more rows in recentchanges is much much worse than having more rows in wbc_e... [22:01:28] 10DBA, 10Data-Services, 10Goal, 10cloud-services-team (FY2017-18): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T142807#3598701 (10bd808) [22:11:26] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3598717 (10eranroz) In hewiki we started to use getBestStatements(Q,P) instead of loading the whole entity on 2/9 - so...