[06:34:14] <wikibugs>	 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10Marostegui) As Jaime pointed out, we have the upgrade in a few days, and if we keep observing this, we can also consider failing over the proxy itself to another host too.
[06:45:46] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) >>! In T246970#5969163, @Jdx wrote: > @zhuyifei1999: But why? The query used to execute in 11 minutes max. Is it a congestion issue, as Mike Peel suspects?  It cou...
[07:02:00] <wikibugs>	 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui)
[08:13:51] <wikibugs>	 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui)
[08:14:03] <wikibugs>	 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) p:05Triage→03Medium
[08:20:02] <wikibugs>	 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui)
[08:20:20] <jynus>	 yesterday dbprov1001 almost got full
[08:20:53] <jynus>	 needs readjusting (moving some data to dbprov[12]002
[08:21:17] <marostegui>	 all of a sudden or there are graphs showing that growth?
[08:21:31] <jynus>	 but I may need to reduce the retention a bit more (but keeping the longer term ones on bacula)
[08:22:17] <jynus>	 in fact, it may have hit 100% again
[08:22:28] <wikibugs>	 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) Some of these hosts had events disabled due to T247728 I have checked and enabled the ones missing.
[08:22:44] <jynus>	 ah, no, that is utilization, that thing we don't care about
[08:23:01] <jynus>	 marostegui: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=dbprov1001&var-instance=dbprov1002&fullscreen&panelId=521&from=1584260575130&to=1584346975130
[08:23:25] <jynus>	 it ended up with 5 instances of a backup
[08:23:47] <jynus>	 because it was dumping s4 and s8 at the same time
[08:24:13] <wikibugs>	 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) I have checked and enabled them on: ` db1107 db2085:3311 db1103:3312 db1103:3314 db2125 db1078 db2109 db2084:3314 db1096:3315 db2084:3315 db1096:3316 db1098:3316 db211...
[08:24:23] <marostegui>	 oh, s4 and s8 at the same time sounds painful yeah .(
[08:24:32] <marostegui>	 maybe move one of them out?
[08:24:32] <jynus>	 so it pushed the expected 76%
[08:24:49] <marostegui>	 although s8 will decrease soonish as we are about to be able to drop wb_terms
[08:24:51] <jynus>	 + 2 uncompressed snapshots
[08:25:01] <jynus>	 yeah, it is a combination of reasons
[08:25:18] <marostegui>	 addshore or Amir1 any ETA for droping wb_terms? Just to know whether it will take 1 month, 2 or 2 weeks. Not pushing anything
[08:25:25] <jynus>	 but there is a 300 GB difference between dbprovs right now
[08:25:36] <jynus>	 so I will move for now only x1
[08:26:17] <addshore>	 marostegui: I hope this month :) 
[08:26:26] <marostegui>	 :___)
[08:26:56] <addshore>	 I'm currently in Geneva thoguh, trying to get back to the UK :D
[08:27:06] <marostegui>	 be safe :(
[08:40:34] <jynus>	 I am going to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/579892 unless you see a mistake there
[08:40:54] <jynus>	 and then create some invites on the calendar to purge manually old backups
[08:42:40] <marostegui>	 Oh, I just +1ed
[08:47:45] <jynus>	 thank you
[08:48:50] <jynus>	 I noticed the imbalance when taking the new snapshots
[08:48:53] <jynus>	 on bacula
[08:51:00] <jynus>	 sadly, it seems we are doing daily incrementals, will fix that
[09:00:41] <jynus>	 my epic fail: https://gerrit.wikimedia.org/r/c/operations/puppet/+/579894
[09:02:02] <marostegui>	 Oh wow
[09:02:11] <marostegui>	 * Monthly: Fulls monthly, diffs every other fortnite, incr. daily
[09:02:13] <jynus>	 I told alex that their name conventions were missleading
[09:02:15] <jynus>	 :-D
[09:02:46] <marostegui>	 There is no way I would have guessed that
[09:02:47] <jynus>	 I intended it to run only on sundays
[09:02:56] <jynus>	 marostegui: that is why I documented it on bacula
[09:03:07] <jynus>	 because it was unclear until I digged further
[09:03:11] <jynus>	 however, you would have noticed
[09:03:20] <jynus>	 the daily backups on the bacula dashboard
[09:03:41] <jynus>	 https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=dbprov2001.codfw.wmnet-Monthly-1st-Sun-Databases-mysql-srv-backups-snapshots-latest&from=1583850411977&to=1584348603489
[09:04:06] <jynus>	 we were consuming close to 3TB a day :-D
[09:04:15] <jynus>	 with both servers
[09:04:36] * marostegui bookmarks that
[09:04:38] <jynus>	 Available Space=8.979 TB :-(
[09:04:41] <marostegui>	 I thought I had it on favs
[09:05:13] <jynus>	 I may purge all snapshots jobs
[09:05:17] <jynus>	 on bacula
[09:31:24] <jynus>	 what is being done on es2, upgrades?
[09:32:52] <marostegui>	 yeah, and the restart
[09:32:54] <marostegui>	 mysql restart
[09:32:58] <marostegui>	 not rebooting the host though
[09:33:19] <marostegui>	 as I fear it might not come back and with DCOPs doing wfh...better not to risk it
[09:39:58] <Amir1>	 addshore marostegui yup, a month at most, I'm picking up the work left 
[09:40:06] <marostegui>	 yay!
[09:40:11] <marostegui>	 those are good news
[10:12:38] <wikibugs>	 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui)
[12:16:35] <addshore>	 Amir1: are you going this lock thingy? :)
[12:16:44] * addshore just got back home
[12:18:02] <Amir1>	 addshore: not right now, the errors are pretty small and it will get smaller once we stop writing to the old system (I'm rebuilding the holes from time to time)
[12:18:27] <Amir1>	 20k last time I queried
[12:18:39] <Amir1>	 that's nothing IMO
[12:18:47] <addshore>	 Amir1: okay, so, whats occouring today? :)
[12:19:06] <Amir1>	 addshore: I'm increasing the reads (already went to Q35M)
[12:19:12] <addshore>	 schweet
[12:19:12] <Amir1>	 everything is now write both
[12:19:22] <Amir1>	 already warming up the cache for Q40M
[12:19:29] * addshore cheers
[12:20:01] <Amir1>	 once it's done, I move to Q40M, slowly to Q80M, then another check for the ones that were left out and then finishing it off
[12:20:17] <Amir1>	 then stopping the writes
[12:20:35] <addshore>	 https://usercontent.irccloud-cdn.com/file/X0nEUi2Y/image.png
[12:20:39] <addshore>	 ^^ any idea what that was?
[12:21:27] <Amir1>	 Did you check it against https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops ?
[12:21:38] <Amir1>	 Item creations are always fun
[12:23:18] <Amir1>	 https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=1583996013132&to=1584065618418&fullscreen&panelId=10 <- I guess that's it
[12:30:20] <addshore>	 oh yes
[12:30:20] <addshore>	 :p
[14:58:09] <Amir1>	 marostegui: hey, I have a fun problem: https://www.wikidata.org/wiki/Q87501510 the label for this (some weird utf-16 I guess) is not saved in wbt_text table but it's saved in wb_terms :D 
[14:58:26] <Amir1>	 I can check the table information, maybe something is off there
[14:58:45] <marostegui>	 do they have the same collation and encoding?
[15:00:31] <Amir1>	 I should check
[15:02:18] <Amir1>	 both charsets are binary, then probably it's not database level, probably app layer
[15:13:11] <marostegui>	 Amir1: have you checked the rows content?
[15:13:50] <Amir1>	 marostegui: I found the problem, it wasn't the charset, it was deadlock. Ran it again and worked fine, sorry for the false alarm
[15:14:09] <marostegui>	 so it was saved in one table but not the other?
[15:16:00] <jynus>	 I asked Amir1 to run comparing checks before declaring work done, as queries and other staff can fail
[15:16:29] <Amir1>	 marostegui: yup
[15:16:42] <Amir1>	 jynus: we have been doing this for a while now, both with hadoop and mysql
[15:16:55] <Amir1>	 https://phabricator.wikimedia.org/T219123#5942140
[15:17:06] <Amir1>	 we fixed most of the holes
[15:17:14] <Amir1>	 I was examining them again
[15:17:19] <marostegui>	 Amir1: Ah, the issue was when your script was migrating stuff from wb_terms into the other one, no? 
[15:17:52] <Amir1>	 yup, the script failed for this one
[15:18:25] <jynus>	 failures like those are normal, we got a lot of those log entries in the past months
[15:18:35] <marostegui>	 Amir1: Ah ok, I was a bit scared at first
[15:19:00] <Amir1>	 yup
[19:40:16] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo)
[19:47:15] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Reminder: pc1010 stopped replication, but pc2 on codfw needs to replicate from it.
[19:48:21] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Krinkle)
[19:52:53] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Both pending documentation and more research, but it is mitigated by being depooled.