[05:15:19] 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3455334 (10Marostegui) >>! In T169050#3454451, @jcrespo wrote: > dbstore1001 is completely broken. It is okay to close this, but declare "At least the two shards that broke are now fixed." w... [05:16:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3455335 (10Marostegui) db1047 is done: ``` root@neodymium:/home/marostegui# for i in `cat s1_tables`; do echo $i; mysql --skip-ssl -hdb1047 enwik... [05:16:45] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3455336 (10Marostegui) [07:45:13] 10DBA, 10Cloud-Services, 10Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3455456 (10Marostegui) s2 has been imported to labsdb1009 and labsdb1010. I will start with labsdb1011 in a bit. The reason I don't do all the h... [08:19:21] 10DBA: duplicate key problems - https://phabricator.wikimedia.org/T151029#3455524 (10jcrespo) more errors: ``` tag_summary... ERROR 1062 (23000) at line 1: Duplicate entry '964297507' for key 'ts_rc_id' user... ERROR 1062 (23000) at line 1: Duplicate entry 'XXXXXXXXXXX' for key 'user_name' watchlist... ERROR 106... [08:41:09] did you already tried the usual trick on db1016? [08:41:20] I did in the morning and it worked [08:41:25] but looks like it failed again [08:41:27] so I have tried again [08:41:44] it failed a couple of hours after it worked [08:41:45] yeah, we may want to failover to some of the new hosts- 1001 and 1016 [08:42:05] we can do without any multiinstance first [08:42:37] yeah, maybe 1001 at least, to have a proper decent slave in case we need to failover db1016 [08:42:48] yes [08:42:57] we have to do some maintenance anyway [08:43:15] so we switch the slave [08:43:19] You want me to grab a new host and clone it from db1001? [08:43:22] then we failover [08:43:53] I almost prefer it to be me [08:43:58] sure :) [08:44:02] if you are going away soon [08:44:10] so I know everthing before failover [08:44:14] sure [08:44:18] that makes sense [08:44:26] in fact [08:44:29] I just don't want you to feel I am slacking off :) [08:44:32] try not to start anything [08:44:42] that cannot be more or less closed [08:44:47] yeah [08:45:02] The only thing I have started is the loading on labsdb1011, which will be finished by sat/sun [08:45:08] so on Monday I can start its replication [08:45:15] Other than that, nothing else will be started [08:45:32] I was doing some googling docs for the backups proposals :) [08:46:48] cool [08:47:11] I will try to deploy prometheus today [08:47:18] \o/ [08:48:44] maybe reloading s1 [08:49:18] but it all depends on the duplicate key issue going away [08:49:34] yeah, i saw your last update [08:50:52] disk usage was literally halved: https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=db2072&var-network=bond0&from=now-7d&to=now [08:51:21] and it took 2 days [08:52:39] oh wow [08:54:08] hey guys [08:54:20] there are a few database-related alerts outstanding, are you aware of these? [08:54:32] paravoid: db1016? [08:54:41] that's one of them, yes :) [08:54:45] yeah, we are dealing with that [08:55:01] db1016 writeback, db1015 disk space (silenced but unacknowledged?), pc1006 disk space [08:55:11] It popped up in the morning, it got fixed, and it came back a bit earlier [08:55:23] the others, we are aware yah [08:55:25] db1015 is an about to decom server, probably the comment got lost [08:55:40] but we need to do something else before doing it [08:55:54] so we cannot yet put it down- it is not pooled [08:56:30] pc1006 is a problem that platform does not ack, so we were going to take our own measures, but I needed support to ignore them [08:56:59] Was the retention patch merged in the end? [08:57:03] yes [08:57:05] The one you submited from 30 to 22 days? [08:57:06] ah ok [08:57:08] db1016 has a bad battery [08:57:15] yes, we know [08:57:16] yeah [08:57:17] just checked out of curiosity :) [08:57:22] we were discussing [08:57:36] nod [08:57:36] think <=db1050 have to go [08:57:43] paravoid: https://phabricator.wikimedia.org/T166344 [08:57:52] so we were planning a switchover [08:58:05] if you see the history, you can see that from time to time it pops up [08:58:10] paravoid: do not worry too much, because there is automatic failover in place [08:58:14] we force a relearn and then it gots back to normal [08:58:17] jeez, May 25th? [08:58:30] yup :( [08:58:33] we can buy a new battery [08:58:40] that host will be decommed [08:58:43] (or use one from another decom system) [08:58:52] it is m1 master [08:58:55] paravoid: it is not happening since then [08:58:55] so not easy to replace [08:59:03] it got fixed and happened yesterday again [08:59:04] that is why we were discussing a switchover [08:59:14] so we can throw that host away [08:59:15] aha [09:37:54] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3455737 (10Marostegui) So for the record, after: T166344#3455435 we got: ``` ˜/icinga-wm 9:04> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` The... [10:48:05] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3455871 (10Marostegui) ``` ˜/icinga-wm 12:44> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [11:19:34] jynus, marostegui: ok to install apache sec updates on dbmonitor* ATM? [11:23:20] one sec - meeting [11:25:13] sure [12:10:33] yes [12:10:50] dbmonitor is tendril- not critical [12:12:30] gone for lunch [12:14:54] k, updating now [13:53:15] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3456280 (10Marostegui) As we discussed, it would be a good idea to do a switchover and get rid of this host, at least as a master of m1. I have two proposals about how we can do it. #1 Take... [14:34:47] 10DBA, 10Cloud-Services, 10Toolforge: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133326#2228235 (10Marostegui) What should we do with this ticket? [14:45:53] 10DBA, 10Cloud-Services, 10Toolforge: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133326#3456453 (10bd808) @kaldari {T118508} was declined (not for great reasons IMO, but whatever), but I thought that the rate limits were raised so that https:/... [16:06:23] 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3456828 (10Marostegui) [16:06:25] 10DBA: Implement slave_run_triggers_for_rbr at sanitarium for labs filtering - https://phabricator.wikimedia.org/T121207#3456825 (10Marostegui) 05Open>03Resolved a:03jcrespo I am going to resolve this as this has been implemented on the new sanitariums servers (db1095 and db1102). db1069 will not be migrat... [16:58:46] 10DBA, 10Cloud-Services, 10Toolforge: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133326#3457049 (10kaldari) I just deleted everything older than 2014, which was about half the tables. [17:17:37] 10DBA, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 3 others: archive table needs index starting with timestamp - https://phabricator.wikimedia.org/T164975#3252901 (10debt) Hi @jcrespo - is this something you'll be able to do? Thanks! [20:07:18] 10DBA, 10Collaboration-Team-Triage, 10Notifications, 10Schema-change: Review new Echo table for user group expiration - https://phabricator.wikimedia.org/T168107#3356374 (10kaldari) @Mattflaschen-WMF: Any update on the delay mechanism for T2582? [20:14:25] 10DBA, 10Collaboration-Team-Triage, 10Notifications, 10Schema-change: Review new Echo table for user group expiration - https://phabricator.wikimedia.org/T168107#3457774 (10Mattflaschen-WMF) >>! In T168107#3457750, @kaldari wrote: > @Mattflaschen-WMF: Any update on the delay mechanism for T2582? Not yet (... [20:43:10] 10DBA, 10Wikimedia-Site-requests: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3281974 (10Dereckson) [21:55:33] 10DBA, 10RESTBase-API, 10Reading List Service, 10ArchCom-RfC (ArchCom-Approved), and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3458249 (10Tgr) @Anomie: thanks for the review! Sorry for taking so long to get back, got distracted by another project. >! In T164990#335143... [22:15:30] 10DBA, 10Operations: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3458384 (10MF-Warburg) Thanks for this reply. I think that settles it. [22:15:45] 10DBA, 10Operations: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3458388 (10MF-Warburg) 05Open>03Resolved a:03MF-Warburg