[06:34:14] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10Marostegui) As Jaime pointed out, we have the upgrade in a few days, and if we keep observing this, we can also consider failing over the proxy itself to another host too. [06:45:46] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) >>! In T246970#5969163, @Jdx wrote: > @zhuyifei1999: But why? The query used to execute in 11 minutes max. Is it a congestion issue, as Mike Peel suspects? It cou... [07:02:00] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) [08:13:51] 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) [08:14:03] 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) p:05Triage→03Medium [08:20:02] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) [08:20:20] yesterday dbprov1001 almost got full [08:20:53] needs readjusting (moving some data to dbprov[12]002 [08:21:17] all of a sudden or there are graphs showing that growth? [08:21:31] but I may need to reduce the retention a bit more (but keeping the longer term ones on bacula) [08:22:17] in fact, it may have hit 100% again [08:22:28] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10Marostegui) Some of these hosts had events disabled due to T247728 I have checked and enabled the ones missing. [08:22:44] ah, no, that is utilization, that thing we don't care about [08:23:01] marostegui: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=dbprov1001&var-instance=dbprov1002&fullscreen&panelId=521&from=1584260575130&to=1584346975130 [08:23:25] it ended up with 5 instances of a backup [08:23:47] because it was dumping s4 and s8 at the same time [08:24:13] 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) I have checked and enabled them on: ` db1107 db2085:3311 db1103:3312 db1103:3314 db2125 db1078 db2109 db2084:3314 db1096:3315 db2084:3315 db1096:3316 db1098:3316 db211... [08:24:23] oh, s4 and s8 at the same time sounds painful yeah .( [08:24:32] maybe move one of them out? [08:24:32] so it pushed the expected 76% [08:24:49] although s8 will decrease soonish as we are about to be able to drop wb_terms [08:24:51] + 2 uncompressed snapshots [08:25:01] yeah, it is a combination of reasons [08:25:18] addshore or Amir1 any ETA for droping wb_terms? Just to know whether it will take 1 month, 2 or 2 weeks. Not pushing anything [08:25:25] but there is a 300 GB difference between dbprovs right now [08:25:36] so I will move for now only x1 [08:26:17] marostegui: I hope this month :) [08:26:26] :___) [08:26:56] I'm currently in Geneva thoguh, trying to get back to the UK :D [08:27:06] be safe :( [08:40:34] I am going to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/579892 unless you see a mistake there [08:40:54] and then create some invites on the calendar to purge manually old backups [08:42:40] Oh, I just +1ed [08:47:45] thank you [08:48:50] I noticed the imbalance when taking the new snapshots [08:48:53] on bacula [08:51:00] sadly, it seems we are doing daily incrementals, will fix that [09:00:41] my epic fail: https://gerrit.wikimedia.org/r/c/operations/puppet/+/579894 [09:02:02] Oh wow [09:02:11] * Monthly: Fulls monthly, diffs every other fortnite, incr. daily [09:02:13] I told alex that their name conventions were missleading [09:02:15] :-D [09:02:46] There is no way I would have guessed that [09:02:47] I intended it to run only on sundays [09:02:56] marostegui: that is why I documented it on bacula [09:03:07] because it was unclear until I digged further [09:03:11] however, you would have noticed [09:03:20] the daily backups on the bacula dashboard [09:03:41] https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=dbprov2001.codfw.wmnet-Monthly-1st-Sun-Databases-mysql-srv-backups-snapshots-latest&from=1583850411977&to=1584348603489 [09:04:06] we were consuming close to 3TB a day :-D [09:04:15] with both servers [09:04:36] * marostegui bookmarks that [09:04:38] Available Space=8.979 TB :-( [09:04:41] I thought I had it on favs [09:05:13] I may purge all snapshots jobs [09:05:17] on bacula [09:31:24] what is being done on es2, upgrades? [09:32:52] yeah, and the restart [09:32:54] mysql restart [09:32:58] not rebooting the host though [09:33:19] as I fear it might not come back and with DCOPs doing wfh...better not to risk it [09:39:58] addshore marostegui yup, a month at most, I'm picking up the work left [09:40:06] yay! [09:40:11] those are good news [10:12:38] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [12:16:35] Amir1: are you going this lock thingy? :) [12:16:44] * addshore just got back home [12:18:02] addshore: not right now, the errors are pretty small and it will get smaller once we stop writing to the old system (I'm rebuilding the holes from time to time) [12:18:27] 20k last time I queried [12:18:39] that's nothing IMO [12:18:47] Amir1: okay, so, whats occouring today? :) [12:19:06] addshore: I'm increasing the reads (already went to Q35M) [12:19:12] schweet [12:19:12] everything is now write both [12:19:22] already warming up the cache for Q40M [12:19:29] * addshore cheers [12:20:01] once it's done, I move to Q40M, slowly to Q80M, then another check for the ones that were left out and then finishing it off [12:20:17] then stopping the writes [12:20:35] https://usercontent.irccloud-cdn.com/file/X0nEUi2Y/image.png [12:20:39] ^^ any idea what that was? [12:21:27] Did you check it against https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops ? [12:21:38] Item creations are always fun [12:23:18] https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=1583996013132&to=1584065618418&fullscreen&panelId=10 <- I guess that's it [12:30:20] oh yes [12:30:20] :p [14:58:09] marostegui: hey, I have a fun problem: https://www.wikidata.org/wiki/Q87501510 the label for this (some weird utf-16 I guess) is not saved in wbt_text table but it's saved in wb_terms :D [14:58:26] I can check the table information, maybe something is off there [14:58:45] do they have the same collation and encoding? [15:00:31] I should check [15:02:18] both charsets are binary, then probably it's not database level, probably app layer [15:13:11] Amir1: have you checked the rows content? [15:13:50] marostegui: I found the problem, it wasn't the charset, it was deadlock. Ran it again and worked fine, sorry for the false alarm [15:14:09] so it was saved in one table but not the other? [15:16:00] I asked Amir1 to run comparing checks before declaring work done, as queries and other staff can fail [15:16:29] marostegui: yup [15:16:42] jynus: we have been doing this for a while now, both with hadoop and mysql [15:16:55] https://phabricator.wikimedia.org/T219123#5942140 [15:17:06] we fixed most of the holes [15:17:14] I was examining them again [15:17:19] Amir1: Ah, the issue was when your script was migrating stuff from wb_terms into the other one, no? [15:17:52] yup, the script failed for this one [15:18:25] failures like those are normal, we got a lot of those log entries in the past months [15:18:35] Amir1: Ah ok, I was a bit scared at first [15:19:00] yup [19:40:16] 10DBA, 10Operations, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) [19:47:15] 10DBA, 10Operations, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Reminder: pc1010 stopped replication, but pc2 on codfw needs to replicate from it. [19:48:21] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Krinkle) [19:52:53] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Both pending documentation and more research, but it is mitigated by being depooled.