[07:56:31] 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4055677 (10Marostegui) [07:57:29] 10DBA, 10Operations, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [07:57:52] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:02:31] marostegui: sorry, problems with the browser [11:02:37] no worries [11:03:16] I need to do 2Fa on chrome, it seems [11:04:02] uh? [11:04:07] ah when login into chrome? [11:04:27] no, I think hangouts stopped finally working for firefox [13:12:16] 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056282 (10jcrespo) 05Resolved>03Open I believe I found the original issue- we recently did a dns failover, and when we do that... [13:29:52] 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056334 (10Magnus) I, for one, have no idea what s51184 is. It doesn't show in the Toolforge "local" database list: ```echo 'show d... [13:33:15] 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056346 (10jcrespo) It is tools.catfood numeric id, I only have visibility of db accounts on my layer. [13:33:51] 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056352 (10jcrespo) I use https://tools.wmflabs.org/contact/ to find the relation between accounts and owners. [13:36:01] 10DBA, 10Data-Services: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers - https://phabricator.wikimedia.org/T180380#4056361 (10jcrespo) 05Open>03Resolved > That would imply it is still a Toolforge bug Ok, I will ask someone else. [13:55:52] jynus: marostegui: One thing, one other change actually got deployed yesterday that writes zero instead of large numbers in term_entity_id in wb_terms. The plan is to drop this column completely [13:56:16] I'm not sure if writing zeros will help anything for now [13:56:53] if it is an int, not really, unless that has a meaning of deletion for the application [13:57:01] but no impact on storage itself [13:57:15] well, I guess it could get a bit of better compression [13:57:25] but the difference would be very small [13:59:52] yeah, I guessed so too [14:00:09] it matters for varchars only [14:16:27] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10aborrero) I just update wikireplica DNS records: ``` root@labcontrol1001:~# /usr/local/sbin/wikireplica_dns --aliases -v --zone web.db.svc.eqiad.wmflabs. 2018-03-16T14... [14:25:01] 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056501 (10chasemp) 05Resolved>03Open @andrew tried to merge the change to allow nova to be more gracious and it didn't work out. https://ph... [14:49:51] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4056565 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1011.eqiad.wmnet'] ``` The log can be found in `/var/... [14:59:23] Another thing; https://grafana.wikimedia.org/dashboard/db/wikidata-change-propagation?orgId=1 says number of rc records injected from Wikidata to client wikis (enwik, ruwiki, etc.) has been dropped from 300/min (total of everywhere) to 100/min. Optimizing RC tables everywhere probably early April will give you a lot. There is already a phab card for that. You can take a look at some wikis to see how much it affected them [15:00:46] I think we already did that, at least on the largest ones- ruwiki, commons [15:01:07] we can check the others at some point :-) [15:01:43] I highly recommend hywiki, 90% of their records were also from wikidata [15:03:47] I think I gathered a list [15:04:40] https://phabricator.wikimedia.org/T178290#3688203 [15:05:34] I indeed skipped hywiki [15:05:49] oh, I didn't [15:06:08] it was just ordered by section, not by alphabetical order [15:09:47] jynus: for ukwiki it's now 26%. I can check other wikis too. I love if you just shrink those and that would free up lots of space for you right now \o/ [15:11:28] note disk space is not such a huge issue (it affects maintenance time, but that is not as important) as api queriues and watchlists being affected in performance [15:14:42] 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056680 (10Andrew) 05Open>03Resolved a:03Andrew >My suggestion is to close this without touching nova Works for me! [15:16:12] 10DBA, 10Cloud-Services, 10Operations, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056683 (10jcrespo) Don't celebrate yet too hard, as it will increase the chances of the issue happening again :-D [15:18:08] 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#4056687 (10Ladsgroup) With lots of changes that happened to usage tracking, now we are injecting several times smaller changes to rc table. I think this ticket should be closed... [15:18:24] 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#4056689 (10jcrespo) [15:18:26] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#4056688 (10jcrespo) [15:18:33] jynus: Can we see if that impacted response time of watchlist/rc queries? [15:20:25] 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#4056697 (10jcrespo) I don't see it as closed, this was going to be used to optimize recentchanges and contains very interesting information to see which to focus first on the c... [15:21:07] Amir1 yes, that is something we should check, as the ticket proposed [15:21:29] it will be directly related to the rate of errors on logstash [15:22:10] although on smaller wikis, degradation will have lesser impact [15:23:38] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4056717 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1011.eqiad.wmnet'] ``` and were **ALL** successful. [15:23:48] 10DBA: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4056718 (10Marostegui) a:03Marostegui I have checksummed m2 and it is fine. We can proceed and decomm this server once the weekend has passed and we are sure the master is fine. [15:24:10] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#4056720 (10jcrespo) [15:24:14] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Patch-For-Review, 10Russian-Sites: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#4056721 (10jcrespo) [15:25:16] 10DBA, 10Wikidata: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521#3977606 (10jcrespo) probably commonswiki and ruwiki were done at T177772. The list of the ones scheduled to do next was at T178290#3688203 [15:28:24] marostegui: thanks for the work [15:28:37] I did nothing, you did all the hard work :-) [15:28:38] should we keep a copy like with the others? [15:28:44] yeah, I want to do that [15:28:58] if it was fine, it can be fully logical [15:29:15] yeah [15:29:19] es2001 maybe? [15:29:26] yeah /older [15:29:30] cool! [15:29:39] or whatever name I made up [15:29:46] 10DBA: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4056730 (10Marostegui) [15:29:57] - if only there was a script to create such backups! [15:30:02] yeah, either older or archive [15:30:07] hahaha [15:30:20] I put on older the ones that are not automatically managed [15:30:28] Ah - I will do that then :) [15:30:32] nor based on time [15:30:39] maybe older wasn't the right name [15:30:50] they were older at the time :-) [15:31:08] it is probably confusing with archive [15:31:37] you can mv the older dir to another name, it is manually handled [15:34:53] du -sh [15:35:50] it didn't took 12 hours to do otrs for you? [15:36:21] it was strange for me because first time it said 1 hour and next time 12h [15:37:13] sorry if I am confusing masters [15:40:21] No, it said 1h for otrs and it was about it [15:40:42] not sure how it does the estimation [15:40:52] but it took around 1h which is what it predicted [15:41:24] but I also noticed that replication was broken on db1020 since yesterday so maybe that is why it said 12h [15:42:00] it was a silly replace on the mysql database for a default_role column, which is 10.1 and not in 10.0, so I guess the first iteraction of pt-table-checksum you ran broke it, and then that's why the estimation kept growing [15:42:06] that is my theory [15:55:02] oh [15:55:08] I didn't notice that, sorry [15:55:19] no worries, it wasn't a big deal [15:55:27] same thing happened with m5, it is now catching up [15:56:31] not sure I want to do a backup es2001 -> db1020 [15:58:19] the other way round? [15:58:34] you can do it as you wish [15:58:42] whatever works for you [15:58:43] Yeah, it is going to be the same really [15:58:55] at the end of the day it needs to be transferred to the other dc [17:33:16] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057152 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1010.eqiad.wmnet'] ``` The log can be found in `/var/... [18:07:24] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057260 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1010.eqiad.wmnet'] ``` and were **ALL** successful. [18:20:53] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4057315 (10jcrespo) [18:20:58] 10DBA, 10Operations, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#4057316 (10jcrespo) [18:21:03] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057311 (10jcrespo) 05Open>03Resolved a:03jcrespo With todays reimage/restart of dbproxy1009, 10 and 11, this should be now 100% done. [18:48:57] 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057377 (10Marostegui) Very nice work! :-)