[06:11:36] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [06:26:00] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [06:26:48] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) s3 progress: [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1002 [] db1124 [] db1123 [] db1095 [] db1078 [] db1077 [] db1075 [06:48:01] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [06:49:39] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [06:49:49] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) 05Open→03Resolved Dropped everywhere! [06:49:54] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [07:17:44] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey once you've transferred all the files to the definite location, this t... [07:18:22] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [07:32:11] 10DBA, 10Operations, 10Performance-Team, 10Patch-For-Review: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) @aaron @Joe @jcrespo I have made the first small change, to go from 22 to 24 days: https://gerrit.wikimedia.org/r/#/c/operati... [08:20:47] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [08:27:46] I would increase the TTL once- after all, you are increasing it not decreasing it, and we can decrease it storage gets low [08:28:22] you mean fully to 30 days? [08:29:11] not feeling strongly about that, so not really pushing for it [08:29:39] yeah, I talked to _joe_ and we thought about doing it slowly, after all we are not in a super rush [08:30:05] jynus: btw, may I ask you to take care of one thing if you've got time? [08:30:14] you won't feel any difference until the 22nd day, and will be in real time, so are you going to wait 6 months to deploy it ? [08:30:51] tell me [08:30:54] https://phabricator.wikimedia.org/T214264#4895339 [08:31:40] we are going to buy the replacement of dbstore2002 soon [08:32:00] yeah, I mean about db2040 [08:32:50] 6 months? [08:32:55] <_joe_> jynus: first of all, I thought the TTL was used by the purging script, not a property of the db record. So how exactly would we only notice only after 22 days? [08:33:15] the idea was to increase it to 24 and if nothing shows up, go to 30 days [08:33:17] <_joe_> do I remember incorrectly? [08:33:34] I also thought we'd notice as soon as the script runs, yeah [08:34:22] <_joe_> my idea is not to raise the TTL unless we get an observable improvement in the cache-hit ratio [08:34:42] _joe_: see the commits, it is horribly implemented [08:35:15] (or see the database) [08:37:43] don't complain to me about that, though! [08:38:46] _joe_: https://phabricator.wikimedia.org/P8013 [08:42:16] jynus: so according to that, we would see effects as soon as we merge [08:42:51] well, on the data, but not on the purge [08:43:08] yeah [08:43:23] so you will have to wait 1 month to see if it was ok the first deploy [08:43:39] then another month for the next deploy, etc. 2 days at a time? [08:43:58] no, not two days at the time, as I said, if we all agree, we would go for 24 days to 30 [08:44:20] so one deploy, let's say today, wait a month and then another deploy and that's it [08:44:55] ok [08:45:14] not very different to what I was proposing [08:45:48] No, just a "small" deploy first and then if it looks good, go back to the original value [08:46:10] ok [08:48:01] so, whenever you have time, give a look at the patches [08:48:06] and I will proceed [08:49:03] I was going to say I will not +1 them [08:49:14] but I looked at them and they were ok [08:49:20] why not +1? [08:49:26] but didn't want to vote so other people did [08:49:55] if I do, probably other people won't look at them [08:50:59] so do you want me to switchover s7-codfw, or do I try to fix it? [08:51:22] I think failover over s7-codfw is better [08:51:28] I doubt we will be able to fix the host [08:51:32] no no [08:51:37] fix it lagging [08:51:44] by other means [08:51:50] it is not lagging [08:51:51] of dark arts [08:51:58] WB is forced [08:52:46] do you want me to pool ssds as masters on all sections? [08:52:58] so this doesn't happen again? [08:53:15] I had to do maintenance on the masters anyway [08:53:25] Not sure about that, I think we still have to have the conversation about the new hardware on codfw with m.ark [08:53:53] welll, but that would result on almost the same thing [08:54:28] yeah, but if you pool an ssd host now, we won't have an ssd slave for the next DC failover, if that happens before we refresh hardware [08:55:15] if we don't have the hardware for the next switch, issues will happen anyway (aka the issues we had last year) [08:55:34] but the issues we had last year would be even worse without ssd replicas [08:55:35] and those hosts are likely to fail any time now [08:55:42] the masters weren't the problem, the slaves were, no? [08:56:01] yeah, they are slowling failing, that's why I want to push for the converstaion to happen [08:56:06] the masters will be a problem if they are 2h behind? [08:56:25] I mean when codfw was active [08:56:47] I am not really asking [08:56:59] the masters are a problem if they are 2h behind [08:57:15] yes, but that is only happening now because of the migration [08:57:40] what I am saying is that I am fine with an SSD master, but that will be a problem once codfw is active and if we haven't bought HW for codfw at the time [08:57:48] not really, it happens every time- even if somone edits on commons quickly [08:58:07] what I am saying is we are screwed without hardware anyway :-D [08:58:28] yeah, but we'd be more screwed if codfw is active [08:58:31] but at least we will not spam operations- [08:59:05] I gave my opinion, but I will let you decide 0:-) [08:59:07] but codfw has not been lagging lately (apart from the migration) [08:59:13] it has [09:00:06] sadly no more than 90 days are stored: https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&var-dc=codfw%20prometheus%2Fops&from=1540285192333&to=1548061192333 [09:00:31] and the previous period because I set those as unsafe replication [09:00:36] keep in mind that some of those were schema changes [09:01:13] as I said, you decide [09:02:30] I still think we should go for a non SSD host on s7 codfw (same as we have in eqiad) and now that I see it…we don't have any SSD on s7 codfw (apart from the sanitarium master and the rc slaves) [09:03:04] ok [09:11:04] 10DBA: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 (10jcrespo) a:03jcrespo [09:50:17] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) [10:00:00] Thanks marostegui for https://phabricator.wikimedia.org/T212255 <3 [10:00:12] :) [10:00:19] thank you for working on it! [10:13:19] what is the state of dbstore1003:s7 ? (for switchover purposes) [10:13:29] not set up [10:13:34] you can ignore it [10:13:38] thanks, that makes it easier [10:13:57] oh, it is also eqiad [10:14:12] yep :) [10:28:36] but 2047 is also degraded [10:28:47] Predictive Failure: 1I:1:1 [10:29:06] yeah, it has been like that for long [10:29:10] it hasn't failed yet [10:29:20] do you really want to switch to that? [10:29:42] from what I can see it failed jan 8th [10:29:49] not failed, predictive disk failure [10:59:07] I've restarted db2047 but will not merge my patch until elukey finish his deploy [10:59:15] ah cool [11:02:09] I am almost done with disabling puppet sorry, it failed very close to finish [11:02:35] (I didn't add -p 95 or similar so it didn't finish after the first error) [11:41:58] jynus,marostegui - I am re-enabling puppet on all the dbs, I've ran puppet on 4/5 of those and I didn't see any change [11:42:02] so you are free to go [11:42:06] great [11:42:08] thank you! [11:42:50] thank you for the patience, took a bit more than expected :) [11:49:21] thanks, elukey [11:49:26] marostegui: I am back [11:49:39] (and you are gone :-)) [11:49:41] see you later [13:28:03] I was gone yeah [13:28:03] hehe [13:58:42] 10DBA, 10MediaWiki-Database, 10Core Platform Team Backlog (Watching / External), 10Performance-Team (Radar), 10Wikimedia-Incident: Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real mast... - https://phabricator.wikimedia.org/T172497 [14:02:27] jynus: when do you want to meet up? [14:04:16] give me 10 minutes at least [14:10:35] sure thing [14:28:01] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey) Done! ` elukey@stat1007:/srv/home/elukey$ sudo -u hdfs hdfs dfs -ls /wmf/data/arc... [14:28:34] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10elukey) [16:38:05] https://www.percona.com/blog/2019/01/18/replication-manager-works-with-mariadb/ [16:38:35] repl.pl improved! :p [16:52:21] 1720-Slot 0 Drive Array - S.M.A.R.T. Hard Drive(s) Detect Imminent Failure: Port 1I: Box 1: Bays 2,10 [16:52:35] all s7 codfw hosts have issues [16:53:34] which one is that host? [16:53:41] db2061 [16:53:51] https://phabricator.wikimedia.org/T208323 [16:53:53] yeah it is there [16:54:40] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [16:55:24] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [19:27:17] 10DBA, 10Patch-For-Review: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 (10jcrespo) ` root@cumin2001:~$ ./software/dbtools/section s7 | while read instance; do echo "$instance:"; mysql.py -h $instance -e "show slave status\G" | grep 'Using_Gtid:'; done labsdb1011: labsdb1010: labsdb1009:... [22:14:47] 10DBA, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Krinkle)