[05:15:19] <wikibugs>	 10DBA: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050#3455334 (10Marostegui) >>! In T169050#3454451, @jcrespo wrote: > dbstore1001 is completely broken. It is okay to close this, but declare "At least the two shards that broke are now fixed." w...
[05:16:31] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3455335 (10Marostegui) db1047 is done: ``` root@neodymium:/home/marostegui# for i in `cat s1_tables`; do echo $i; mysql --skip-ssl -hdb1047 enwik...
[05:16:45] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3455336 (10Marostegui)
[07:45:13] <wikibugs>	 10DBA, 10Cloud-Services, 10Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3455456 (10Marostegui) s2 has been imported to labsdb1009 and labsdb1010. I will start with labsdb1011 in a bit.  The reason I don't do all the h...
[08:19:21] <wikibugs>	 10DBA: duplicate key problems - https://phabricator.wikimedia.org/T151029#3455524 (10jcrespo) more errors: ``` tag_summary... ERROR 1062 (23000) at line 1: Duplicate entry '964297507' for key 'ts_rc_id' user... ERROR 1062 (23000) at line 1: Duplicate entry 'XXXXXXXXXXX' for key 'user_name' watchlist... ERROR 106...
[08:41:09] <jynus>	 did you already tried the usual trick on db1016?
[08:41:20] <marostegui>	 I did in the morning and it worked
[08:41:25] <marostegui>	 but looks like it failed again
[08:41:27] <marostegui>	 so I have tried again
[08:41:44] <marostegui>	 it failed a couple of hours after it worked
[08:41:45] <jynus>	 yeah, we may want to failover to some of the new hosts- 1001 and 1016
[08:42:05] <jynus>	 we can do without any multiinstance first
[08:42:37] <marostegui>	 yeah, maybe 1001 at least, to have a proper decent slave in case we need to failover db1016
[08:42:48] <jynus>	 yes
[08:42:57] <jynus>	 we have to do some maintenance anyway
[08:43:15] <jynus>	 so we switch the slave
[08:43:19] <marostegui>	 You want me to grab a new host and clone it from db1001?
[08:43:22] <jynus>	 then we failover
[08:43:53] <jynus>	 I almost prefer it to be me
[08:43:58] <marostegui>	 sure :)
[08:44:02] <jynus>	 if you are going away soon
[08:44:10] <jynus>	 so I know everthing before failover
[08:44:14] <marostegui>	 sure
[08:44:18] <marostegui>	 that makes sense
[08:44:26] <jynus>	 in fact
[08:44:29] <marostegui>	 I just don't want you to feel I am slacking off :)
[08:44:32] <jynus>	 try not to start anything
[08:44:42] <jynus>	 that cannot be more or less closed
[08:44:47] <marostegui>	 yeah
[08:45:02] <marostegui>	 The only thing I have started is the loading on labsdb1011, which will be finished by sat/sun
[08:45:08] <marostegui>	 so on Monday I can start its replication
[08:45:15] <marostegui>	 Other than that, nothing else will be started
[08:45:32] <marostegui>	 I was doing some googling docs for the backups proposals :)
[08:46:48] <jynus>	 cool
[08:47:11] <jynus>	 I will try to deploy prometheus today
[08:47:18] <marostegui>	 \o/
[08:48:44] <jynus>	 maybe reloading s1
[08:49:18] <jynus>	 but it all depends on the duplicate key issue going away
[08:49:34] <marostegui>	 yeah, i saw your last update
[08:50:52] <jynus>	 disk usage was literally halved: https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=db2072&var-network=bond0&from=now-7d&to=now
[08:51:21] <jynus>	 and it took 2 days
[08:52:39] <marostegui>	 oh wow
[08:54:08] <paravoid>	 hey guys
[08:54:20] <paravoid>	 there are a few database-related alerts outstanding, are you aware of these?
[08:54:32] <marostegui>	 paravoid: db1016?
[08:54:41] <paravoid>	 that's one of them, yes :)
[08:54:45] <marostegui>	 yeah, we are dealing with that
[08:55:01] <paravoid>	 db1016 writeback, db1015 disk space (silenced but unacknowledged?), pc1006 disk space
[08:55:11] <marostegui>	 It popped up in the morning, it got fixed, and it came back a bit earlier
[08:55:23] <marostegui>	 the others, we are aware yah
[08:55:25] <jynus>	 db1015 is an about to decom server, probably the comment got lost
[08:55:40] <jynus>	 but we need to do something else before doing it
[08:55:54] <jynus>	 so we cannot yet put it down- it is not pooled
[08:56:30] <jynus>	 pc1006 is a problem that platform does not ack, so we were going to take our own measures, but I needed support to ignore them
[08:56:59] <marostegui>	 Was the retention patch merged in the end?
[08:57:03] <jynus>	 yes
[08:57:05] <marostegui>	 The one you submited from 30 to 22 days?
[08:57:06] <marostegui>	 ah ok
[08:57:08] <paravoid>	 db1016 has a bad battery
[08:57:15] <jynus>	 yes, we know
[08:57:16] <marostegui>	 yeah
[08:57:17] <paravoid>	 just checked out of curiosity :)
[08:57:22] <jynus>	 we were discussing
[08:57:36] <paravoid>	 nod
[08:57:36] <jynus>	 think <=db1050 have to go
[08:57:43] <marostegui>	 paravoid: https://phabricator.wikimedia.org/T166344
[08:57:52] <jynus>	 so we were planning a switchover
[08:58:05] <marostegui>	 if you see the history, you can see that from time to time it pops up
[08:58:10] <jynus>	 paravoid: do not worry too much, because there is automatic failover in place
[08:58:14] <marostegui>	 we force a relearn and then it gots back to normal
[08:58:17] <paravoid>	 jeez, May 25th?
[08:58:30] <marostegui>	 yup :(
[08:58:33] <paravoid>	 we can buy a new battery
[08:58:40] <marostegui>	 that host will be decommed
[08:58:43] <paravoid>	 (or use one from another decom system)
[08:58:52] <marostegui>	 it is m1 master
[08:58:55] <jynus>	 paravoid: it is not happening since then
[08:58:55] <marostegui>	 so not easy to replace
[08:59:03] <jynus>	 it got fixed and happened yesterday again
[08:59:04] <marostegui>	 that is why we were discussing a switchover
[08:59:14] <marostegui>	 so we can throw that host away
[08:59:15] <paravoid>	 aha
[09:37:54] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3455737 (10Marostegui) So for the record, after: T166344#3455435 we got: ``` ˜/icinga-wm 9:04> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ```  The...
[10:48:05] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3455871 (10Marostegui) ``` ˜/icinga-wm 12:44> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ```
[11:19:34] <moritzm>	 jynus, marostegui: ok to install apache sec updates on dbmonitor* ATM?
[11:23:20] <marostegui>	 one sec - meeting
[11:25:13] <moritzm>	 sure
[12:10:33] <jynus>	 yes
[12:10:50] <jynus>	 dbmonitor is tendril- not critical
[12:12:30] <jynus>	 gone for lunch
[12:14:54] <moritzm>	 k, updating now
[13:53:15] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3456280 (10Marostegui) As we discussed, it would be a good idea to do a switchover and get rid of this host, at least as a master of m1.  I have two proposals about how we can do it.  #1  Take...
[14:34:47] <wikibugs>	 10DBA, 10Cloud-Services, 10Toolforge: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133326#2228235 (10Marostegui) What should we do with this ticket?
[14:45:53] <wikibugs>	 10DBA, 10Cloud-Services, 10Toolforge: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133326#3456453 (10bd808) @kaldari {T118508} was declined (not for great reasons IMO, but whatever), but I thought that the rate limits were raised so that https:/...
[16:06:23] <wikibugs>	 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3456828 (10Marostegui)
[16:06:25] <wikibugs>	 10DBA: Implement slave_run_triggers_for_rbr at sanitarium for labs filtering - https://phabricator.wikimedia.org/T121207#3456825 (10Marostegui) 05Open>03Resolved a:03jcrespo I am going to resolve this as this has been implemented on the new sanitariums servers (db1095 and db1102). db1069 will not be migrat...
[16:58:46] <wikibugs>	 10DBA, 10Cloud-Services, 10Toolforge: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) - https://phabricator.wikimedia.org/T133326#3457049 (10kaldari) I just deleted everything older than 2014, which was about half the tables.
[17:17:37] <wikibugs>	 10DBA, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 3 others: archive table needs index starting with timestamp - https://phabricator.wikimedia.org/T164975#3252901 (10debt) Hi @jcrespo - is this something you'll be able to do? Thanks!
[20:07:18] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10Notifications, 10Schema-change: Review new Echo table for user group expiration - https://phabricator.wikimedia.org/T168107#3356374 (10kaldari) @Mattflaschen-WMF: Any update on the delay mechanism for T2582?
[20:14:25] <wikibugs>	 10DBA, 10Collaboration-Team-Triage, 10Notifications, 10Schema-change: Review new Echo table for user group expiration - https://phabricator.wikimedia.org/T168107#3457774 (10Mattflaschen-WMF) >>! In T168107#3457750, @kaldari wrote: > @Mattflaschen-WMF: Any update on the delay mechanism for T2582?  Not yet (...
[20:43:10] <wikibugs>	 10DBA, 10Wikimedia-Site-requests: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3281974 (10Dereckson)
[21:55:33] <wikibugs>	 10DBA, 10RESTBase-API, 10Reading List Service, 10ArchCom-RfC (ArchCom-Approved), and 4 others: RfC: Reading List service - https://phabricator.wikimedia.org/T164990#3458249 (10Tgr) @Anomie: thanks for the review! Sorry for taking so long to get back, got distracted by another project.  >! In T164990#335143...
[22:15:30] <wikibugs>	 10DBA, 10Operations: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3458384 (10MF-Warburg) Thanks for this reply. I think that settles it.
[22:15:45] <wikibugs>	 10DBA, 10Operations: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3458388 (10MF-Warburg) 05Open>03Resolved a:03MF-Warburg