[07:56:31] <wikibugs>	 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4055677 (10Marostegui)
[07:57:29] <wikibugs>	 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui)
[07:57:52] <wikibugs>	 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui)
[11:02:31] <jynus>	 marostegui: sorry, problems with the browser
[11:02:37] <marostegui>	 no worries
[11:03:16] <jynus>	 I need to do 2Fa on chrome, it seems
[11:04:02] <marostegui>	 uh? 
[11:04:07] <marostegui>	 ah when login into chrome?
[11:04:27] <jynus>	 no, I think hangouts stopped finally working for firefox
[13:12:16] <wikibugs>	 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056282 (10jcrespo) 05Resolved>03Open I believe I found the original issue- we recently did a dns failover, and when we do that...
[13:29:52] <wikibugs>	 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056334 (10Magnus) I, for one, have no idea what s51184 is. It doesn't show in the Toolforge "local" database list: ```echo 'show d...
[13:33:15] <wikibugs>	 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056346 (10jcrespo) It is tools.catfood numeric id, I only have visibility of db accounts on my layer.
[13:33:51] <wikibugs>	 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056352 (10jcrespo) I use https://tools.wmflabs.org/contact/ to find the relation between accounts and owners.
[13:36:01] <wikibugs>	 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056361 (10jcrespo) 05Open>03Resolved > That would imply it is still a Toolforge bug  Ok, I will ask someone else.
[13:55:52] <Amir1>	 jynus: marostegui: One thing, one other change actually got deployed yesterday that writes zero instead of large numbers in term_entity_id in wb_terms. The plan is to drop this column completely 
[13:56:16] <Amir1>	 I'm not sure if writing zeros will help anything for now
[13:56:53] <jynus>	 if it is an int, not really, unless that has a meaning of deletion for the application
[13:57:01] <jynus>	 but no impact on storage itself
[13:57:15] <jynus>	 well, I guess it could get a bit of better compression
[13:57:25] <jynus>	 but the difference would be very small
[13:59:52] <Amir1>	 yeah, I guessed so too
[14:00:09] <Amir1>	 it matters for varchars only
[14:16:27] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10aborrero) I just update wikireplica DNS records:  ``` root@labcontrol1001:~# /usr/local/sbin/wikireplica_dns --aliases -v --zone web.db.svc.eqiad.wmflabs. 2018-03-16T14...
[14:25:01] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056501 (10chasemp) 05Resolved>03Open @andrew tried to merge the change to allow nova to be more gracious and it didn't work out.  https://ph...
[14:49:51] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4056565 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1011.eqiad.wmnet'] ``` The log can be found in `/var/...
[14:59:23] <Amir1>	 Another thing; https://grafana.wikimedia.org/dashboard/db/wikidata-change-propagation?orgId=1 says number of rc records injected from Wikidata to client wikis (enwik, ruwiki, etc.) has been dropped from 300/min (total of everywhere) to 100/min. Optimizing RC tables everywhere probably early April will give you a lot. There is already a phab card for that. You can take a look at some wikis to see how much it affected them
[15:00:46] <jynus>	 I think we already did that, at least on the largest ones- ruwiki, commons
[15:01:07] <jynus>	 we can check the others at some point :-)
[15:01:43] <Amir1>	 I highly recommend hywiki, 90% of their records were also from wikidata
[15:03:47] <jynus>	 I think I gathered a list
[15:04:40] <jynus>	 https://phabricator.wikimedia.org/T178290#3688203
[15:05:34] <jynus>	 I indeed skipped hywiki
[15:05:49] <jynus>	 oh, I didn't
[15:06:08] <jynus>	 it was just ordered by section, not by alphabetical order
[15:09:47] <Amir1>	 jynus: for ukwiki it's now 26%. I can check other wikis too. I love if you just shrink those and that would free up lots of space for you right now \o/
[15:11:28] <jynus>	 note disk space is not such a huge issue (it affects maintenance time, but that is not as important) as api queriues and watchlists being affected in performance
[15:14:42] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056680 (10Andrew) 05Open>03Resolved a:03Andrew >My suggestion is to close this without touching nova  Works for me!
[15:16:12] <wikibugs>	 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056683 (10jcrespo) Don't celebrate yet too hard, as it will increase the chances of the issue happening again :-D
[15:18:08] <wikibugs>	 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#4056687 (10Ladsgroup) With lots of changes that happened to usage tracking, now we are injecting several times smaller changes to rc table. I think this ticket should be closed...
[15:18:24] <wikibugs>	 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#4056689 (10jcrespo)
[15:18:26] <wikibugs>	 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#4056688 (10jcrespo)
[15:18:33] <Amir1>	 jynus: Can we see if that impacted response time of watchlist/rc queries?
[15:20:25] <wikibugs>	 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#4056697 (10jcrespo) I don't see it as closed, this was going to be used to optimize recentchanges and contains very interesting information to see which to focus first on the c...
[15:21:07] <jynus>	 Amir1 yes, that is something we should check, as the ticket proposed
[15:21:29] <jynus>	 it will be directly related to the rate of errors on logstash
[15:22:10] <jynus>	 although on smaller wikis, degradation will have lesser impact
[15:23:38] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4056717 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1011.eqiad.wmnet'] ```  and were **ALL** successful.
[15:23:48] <wikibugs>	 10DBA: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4056718 (10Marostegui) a:03Marostegui I have checksummed m2 and it is fine. We can proceed and decomm this server once the weekend has passed and we are sure the master is fine.
[15:24:10] <wikibugs>	 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#4056720 (10jcrespo)
[15:24:14] <wikibugs>	 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Patch-For-Review, 10Russian-Sites: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#4056721 (10jcrespo)
[15:25:16] <wikibugs>	 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#3977606 (10jcrespo) probably commonswiki and ruwiki were done at T177772. The list of the ones scheduled to do next was at T178290#3688203
[15:28:24] <jynus>	 marostegui: thanks for the work
[15:28:37] <marostegui>	 I did nothing, you did all the hard work :-)
[15:28:38] <jynus>	 should we keep a copy like with the others?
[15:28:44] <marostegui>	 yeah, I want to do that
[15:28:58] <jynus>	 if it was fine, it can be fully logical
[15:29:15] <marostegui>	 yeah
[15:29:19] <marostegui>	 es2001 maybe?
[15:29:26] <jynus>	 yeah /older
[15:29:30] <marostegui>	 cool!
[15:29:39] <jynus>	 or whatever name I made up
[15:29:46] <wikibugs>	 10DBA: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4056730 (10Marostegui)
[15:29:57] <jynus>	 - if only there was a script to create such backups!
[15:30:02] <marostegui>	 yeah, either older or archive
[15:30:07] <marostegui>	 hahaha
[15:30:20] <jynus>	 I put on older the ones that are not automatically managed
[15:30:28] <marostegui>	 Ah - I will do that then :)
[15:30:32] <jynus>	 nor based on time
[15:30:39] <jynus>	 maybe older wasn't the right name
[15:30:50] <jynus>	 they were older at the time :-)
[15:31:08] <jynus>	 it is probably confusing with archive
[15:31:37] <jynus>	 you can mv the older dir to another name, it is manually handled
[15:34:53] <marostegui>	 du -sh
[15:35:50] <jynus>	 it didn't took 12 hours to do otrs for you?
[15:36:21] <jynus>	 it was strange for me because first time it said 1 hour and next time 12h
[15:37:13] <jynus>	 sorry if I am confusing masters
[15:40:21] <marostegui>	 No, it said 1h for otrs and it was about it
[15:40:42] <marostegui>	 not sure how it does the estimation
[15:40:52] <marostegui>	 but it took around 1h which is what it predicted
[15:41:24] <marostegui>	 but I also noticed that replication was broken on db1020 since yesterday so maybe that is why it said 12h
[15:42:00] <marostegui>	 it was a silly replace on the mysql database for a default_role column, which is 10.1 and not in 10.0, so I guess the first iteraction of pt-table-checksum you ran broke it, and then that's why the estimation kept growing
[15:42:06] <marostegui>	 that is my theory
[15:55:02] <jynus>	 oh
[15:55:08] <jynus>	 I didn't notice that, sorry
[15:55:19] <marostegui>	 no worries, it wasn't a big deal
[15:55:27] <marostegui>	 same thing happened with m5, it is now catching up
[15:56:31] <marostegui>	 not sure I want to do a backup es2001 -> db1020 
[15:58:19] <jynus>	 the other way round?
[15:58:34] <jynus>	 you can do it as you wish
[15:58:42] <jynus>	 whatever works for you
[15:58:43] <marostegui>	 Yeah, it is going to be the same really
[15:58:55] <marostegui>	 at the end of the day it needs to be transferred to the other dc
[17:33:16] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057152 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1010.eqiad.wmnet'] ``` The log can be found in `/var/...
[18:07:24] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057260 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1010.eqiad.wmnet'] ```  and were **ALL** successful.
[18:20:53] <wikibugs>	 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4057315 (10jcrespo)
[18:20:58] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#4057316 (10jcrespo)
[18:21:03] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057311 (10jcrespo) 05Open>03Resolved a:03jcrespo With todays reimage/restart of dbproxy1009, 10 and 11, this should be now 100% done.
[18:48:57] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057377 (10Marostegui) Very nice work! :-)