[00:36:09] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4037166 (10MaxSem) @jcrespo: if we shaved off 80M+ rows totaling 6G+ in size from a different place, would it allevia... [06:31:48] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4036935 (10Marostegui) dbstore1002 is an analytics slave, it replicates all the shards, so its performance isn't great. It has different credentials on purpose. I d... [06:44:27] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037480 (10Marostegui) [06:44:31] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4037478 (10Marostegui) 05Open>03Resolved I am going to close this as resolved as nothing has come up. We will follow up the decommission of db1009 on T189... [07:10:09] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4037485 (10MaxSem) No, the issue is opposite - lag checks can't see it so more writes get piled than it can handle. [07:14:55] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4037486 (10Marostegui) >>! In T189263#4037485, @MaxSem wrote: > No, the issue is opposite - lag checks can't see it so more writes get piled than it can handle. Ah... [07:25:27] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4037504 (10Marostegui) Which user is used to check for this? wikiadmin? wikiuser? [07:43:32] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4037532 (10jcrespo) 05Open>03Invalid dbstore1002 is not a MediaWiki replica- it is not part of the production infrastructure. While it is nice to wait for it, i... [07:48:40] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4037538 (10jcrespo) To give more context, until it broke, dbstore1001 was behind exactly 1 day for recovery purposes- and that was ok. [08:20:09] I will start deploying to mediawiki now [08:20:24] deploying what? [08:20:32] ah [08:20:34] db1114 [08:20:35] ok :) [08:26:45] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037570 (10Marostegui) >>! In T183469#4031760, @Marostegui wrote: > In order to replace db1020 (m2 master) and following: https://gerrit.wikimedia.org/... [08:32:58] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037577 (10jcrespo) I had another idea (not much dissimilar) having into account the new hosts that are coming- dumps will greatly benefit from larger... [08:33:31] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037578 (10jcrespo) [08:34:08] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037581 (10Marostegui) >>! In T183469#4037577, @jcrespo wrote: > I had another idea (not much dissimilar) having into account the new hosts that are co... [08:34:33] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037582 (10jcrespo) yes, sorry, multiinstance :-P [08:37:54] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037587 (10Marostegui) I am going to move the backups of db1009 from db1113 to db1114 [08:38:45] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037588 (10jcrespo) maybe es2001? It has an older directory. [08:39:24] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037590 (10Marostegui) >>! In T183469#4037588, @jcrespo wrote: > maybe es2001? It has an older directory. I wanted to avoid cross-dc transfers....but... [08:40:19] 85040 Back Full 165,158 928.1 G es2001.codfw.wmnet-Monthly-1st-Mon-production-mysql-srv-backups-latest is running [08:40:42] nice! [08:46:57] db1114 is confusing- writes are lower, but also job queue problem has reduced and now it is on peak low, so difficult to measure [08:48:36] but I think write thougput is higher than other hosts still [08:49:01] so now it has the downgraded kernel? [08:49:34] yes [08:49:41] I think it is a mariadb 10.1 thing [08:51:34] should we give db1063 some main traffic and see if it behaves similarly? [08:51:38] it is vslow now [08:56:41] let me repool db1114 fully first [08:56:48] sure [09:01:52] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4037597 (10jcrespo) [09:18:42] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4037613 (10jcrespo) The lower concurrency is better, but the problem it is still ongoing- it is too "bursty"- moments where many connecti... [09:57:00] error have not come back, at least not immediately, but 1) it is too early 2) the load is not as high as when they happened [09:57:22] so if it doesn't come back…it is the kernel? [09:57:30] I mean, that's the difference with yesterday [09:57:31] ? [09:57:44] I think it could be just the restart [09:57:57] well, and as I say, we have to wait [09:58:05] yeah, and try the other kernel again I guess [09:58:12] the newest one [09:58:23] yes [09:58:31] 10.1 is writing more [09:58:41] that is almost sure [09:58:55] which means we may have to compress stuff [09:59:08] so maybe that's why the rc slaves are not affected? [09:59:39] what is "that" on that last sentence? [09:59:53] the slaves are compressed [10:00:00] the rc slaves [10:00:06] I honestly don't know [10:00:14] there are 3 issues going on at the same time [10:00:28] the hight writes/connections [10:00:40] on general enwiki/master [10:00:41] What I am saying is that if your theory is correct and it might be 10.1 and we need to compress…maybe the issue is not happening on the rc slaves, partially, because they are compressed [10:00:53] the 10.1 writes more than 10.0 [10:00:58] So maybe we can test db1063, which is also 10.1 [10:01:01] and the connection errors on db1116 [10:01:05] *1114 [10:01:22] and now, things are changes on 2 variables [10:01:34] maybe more [10:02:14] writes are lower: https://tendril.wikimedia.org/host/view/db1052.eqiad.wmnet/3306 [10:02:38] but is it the concurrency lowering? is it the backlog going away? or is it that it is fryday morning? [10:02:48] it might be because of the hour [10:02:52] or a mix of all of them [10:02:59] there are no more errors on db1114, but it is because lower write load [10:03:05] the reboot [10:03:07] the kernel? [10:03:23] I guess we need to wait till monday to have some conclusions [10:04:00] no errors still so far [10:06:34] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4037678 (10jcrespo) In other order of things, is it normal to still get errors from 127.0.0.1, which I think it points to the older queue... [10:09:47] I have cancelled the ongoing bacula job, it was going to backup some garbage files that were still there by mistake [10:10:16] a new job will run tomorrow [10:10:27] cool! [10:13:27] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037688 (10Marostegui) >>! In T183469#4037577, @jcrespo wrote: > I had another idea (not much dissimilar) having into account the new hosts that are co... [10:16:19] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037692 (10jcrespo) ok to me- with "candidate hosts" and "statement hosts", the puzzle gets more and more difficult. [10:23:02] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4037711 (10Marostegui) >>! In T183469#4037692, @jcrespo wrote: > ok to me- with "candidate hosts" and "statement hosts", the puzzle gets more and more... [10:29:39] 10DBA, 10Patch-For-Review: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#4037715 (10Marostegui) [10:46:20] errors came back :-) [10:46:30] :( [10:47:02] it is good [10:47:02] Don't know if I prefer it to be the kernel or 10.1, I guess kernel XD [10:47:07] it is a metric [10:47:22] they came back 15 minutes ago [10:48:23] I think we may have to change pool of connection metrics- it cannot handle bursts like the other hosts [10:48:43] so that is 10.1 [10:48:49] that specific thing, I mean [10:48:56] that happened with 5.5 -> 10.0 masters [10:49:06] 10DBA, 10Patch-For-Review: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#3874417 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1113.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2... [10:49:19] and it is not happening on special slaves becuause those should not receive that kind of load [10:49:43] So what happened from 5.5 to 10.0? [10:49:51] Kinda the same behaviour? [10:49:53] my theory is that it is 10.1 and it is not at the same time- I think the connection patterns are not ideal from the job runners [10:50:09] marostegui: stall limit configuration had to be changed [10:50:49] there are 800-1000 new connections per minute [10:51:24] to that particular host? [10:51:26] yes [10:51:29] or in general? [10:51:42] the number is high, but it is equal to other 10.0 hosts [10:51:59] but it causes way more aborted connects [10:52:34] it could be anything, but if it casued problems in the past-- I will look at the pool configuration first [10:53:12] now, I think the number of new connections is high, and that is the above ticket I commented on [10:53:48] I will try things, will require your help if I get stuck/out of ideas [10:54:20] I think it should be me because the issue was very similar to one I solved 2-3 years ago [10:55:01] yes [10:55:02] of coruse [10:55:09] plus you said you like those things :) [10:55:30] I will keep working on the schema change, the replacement for the vslow/m1/m2 hosts and the checksumming [10:57:03] I think this is important because blocker on 10.1 [10:57:11] not because of the host itself [10:57:12] it is a big thing [10:57:27] I think it is a 10.1 regression/config change [10:57:53] I will try to sprint to see if I can get to decomm the pending hosts for m1 and m2, and finish the decommissioning goal [10:57:56] we still have 2 weeks [10:58:09] The checksumming is going good, I hope to get it done by early next week [10:58:52] it is ok, I will try to finish the backups in the background [10:59:16] and then help with the decommissioning [11:03:59] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-03-13 (1.31.0-wmf.25)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4037759 (10jcrespo) As a followup, and out of scope of this ticket- are... [11:06:48] 10DBA, 10Patch-For-Review: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#4037762 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1113.eqiad.wmnet'] ``` and were **ALL** successful. [11:29:11] I would appreciate a quick review of: https://gerrit.wikimedia.org/r/#/c/417812/ [11:29:19] (not urgent) [11:31:42] thanks! [11:34:30] I've been busy yesterday afternoon and didn't manage to follow up more closely on the possible kernel performance regression for DBs, is there a new status in the mean time or anything that needs my attention? [11:35:11] we don't believe there is a kernel regression, but instead a mariadb 10.1 configuration, but there is not yet conclusive data [11:35:18] ok, thanks [11:35:32] because load has lowered, errors have, too [11:35:46] 10DBA, 10Patch-For-Review: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#4037806 (10Marostegui) a:03Marostegui [11:35:57] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-03-13 (1.31.0-wmf.25)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4037808 (10EddieGP) >>! In T176754#4037759, @jcrespo wrote: > As a foll... [11:37:03] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-03-13 (1.31.0-wmf.25)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4037814 (10jcrespo) That makes sense, thank you. [13:17:16] 10DBA, 10Patch-For-Review: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161#4038064 (10Marostegui) [13:40:59] 10DBA, 10OTRS: OTRS database is "too large" - https://phabricator.wikimedia.org/T138915#4038079 (10jcrespo) Yearly state of otrs report (compressed): ``` 12G ./dump.m1.2018-03-07--13-22-38 329G ./dump.m2.2018-03-07--13-47-42 29G ./dump.m3.2018-03-07--21-28-09 11G ./dump.m5.2018-03-07--17-52-13 9... [14:53:49] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034945 (10Pchelolo) > In other order of things, is it normal to still get errors from 127.0.0.1, which I think it points to the older qu... [15:03:00] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038274 (10jcrespo) Thanks, that last comment was indeed *very very* useful. > one-by-one and that means each job establishes a new conn... [15:11:01] 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4038281 (10jcrespo) [15:12:43] 10DBA, 10Operations, 10Goal: Generate consistent logical database backups in CODFW - https://phabricator.wikimedia.org/T184699#4038285 (10jcrespo) [15:12:47] 10DBA, 10Patch-For-Review: Failover existing eqiad database backup system to the new codfw database logical backup system - https://phabricator.wikimedia.org/T184697#4038286 (10jcrespo) [15:12:49] 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#3892707 (10jcrespo) 05Open>03Resolved Technically this is done and automated- however it needs some efficiency and speed improvements (parallel... [15:13:20] 10DBA, 10Patch-For-Review: Failover existing eqiad database backup system to the new codfw database logical backup system - https://phabricator.wikimedia.org/T184697#3892734 (10jcrespo) We should check the generation is working this week. [15:14:23] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038288 (10Pchelolo) > Proxy/connection pool is something that we are going to use for crossdc connections, so it was already in the back... [15:16:19] 10DBA, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562#4038289 (10jcrespo) [15:19:00] 10DBA, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562#4038294 (10jcrespo) [15:19:07] 10DBA, 10Operations, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#4038292 (10jcrespo) 05Open>03Resolved The previous ticket was closed, other things, like coordionaton and monitoring previously mentioned will be hand... [15:20:54] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#4038297 (10jcrespo) This is partially implemented on T184696, if backups fail, they are not rotated. The... [15:21:41] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: Create script to monitor db dumps for backups are successful - https://phabricator.wikimedia.org/T151999#4038299 (10jcrespo) [15:23:58] 10DBA, 10Epic: [META ticket] Automation for our DBs tracking task - https://phabricator.wikimedia.org/T156461#4038309 (10jcrespo) [15:24:00] 10DBA, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562#4038310 (10jcrespo) [15:24:02] 10DBA: Automate dataset backup recovery - https://phabricator.wikimedia.org/T157668#4038307 (10jcrespo) 05Open>03declined Conditions have changed- I could say there is already an automatically recovery script, but there needs to be some work on it, but this was filed when backups were done the old fashion. D... [15:26:13] 10DBA, 10Operations: Puppetize grants for mysql hosts that are the source of recovery (dbstore, passive misc) - https://phabricator.wikimedia.org/T111929#4038314 (10jcrespo) [15:27:15] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4038321 (10Marostegui) db1113:3315 and db1113:3316 are now compressing tables. I will pool this host on Monday and if it all goes fine for 24h, I will... [15:27:50] 10DBA: Use tls for dump backup generation - https://phabricator.wikimedia.org/T151583#4038322 (10jcrespo) This needs more work, probably recompile mydumper to support modern standards (TLS1.2+). [15:30:21] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#4038328 (10jcrespo) [15:30:24] 10DBA, 10Operations, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#4038327 (10jcrespo) [15:32:29] 10DBA, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562#4038343 (10jcrespo) [15:32:35] 10DBA, 10Operations, 10Goal: Generate consistent logical database backups in CODFW - https://phabricator.wikimedia.org/T184699#4038342 (10jcrespo) [15:35:35] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038391 (10jcrespo) > I think it would be much better do it on your side of the "fence" I can own this no problem, but if I do, I will a... [15:40:42] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#4038396 (10Marostegui) is this still an issue? [15:42:27] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 6 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4038398 (10jcrespo) [15:51:34] 10DBA, 10Wikidata, 10Performance: Spiky write pattern on core db masters - https://phabricator.wikimedia.org/T144382#2597752 (10Marostegui) I believe this is no longer happening? [15:59:42] 10DBA, 10Wikidata, 10Performance: Spiky write pattern on core db masters - https://phabricator.wikimedia.org/T144382#4038455 (10jcrespo) 05Open>03Resolved a:03hoo Not that I can see: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-s... [16:45:14] 10DBA, 10Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#4038593 (10jcrespo) a:03Andrew Close or invalid, probably? [16:47:46] 10DBA, 10Data-Services: Add base36 functions to ToolForge database - https://phabricator.wikimedia.org/T185673#4038597 (10jcrespo) UDFs requires some C codings and compilation: https://dev.mysql.com/doc/refman/5.7/en/adding-udf.html Could you mean maybe, stored procedures? That would be much easier. [17:29:27] 10DBA, 10Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#4038701 (10Andrew) 05Open>03Invalid Yep, this is moot as silver is about to be switched off. [17:31:05] 10DBA, 10Operations: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#4038719 (10jcrespo) Andrew- I will not do that, but you may want to search open tickets with the keyworkd "silver" or "wikitech"- probably you will be able to get rid of a lot o... [18:53:19] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4038914 (10kaldari) @jcrespo: Any further thoughts on this or are you waiting on feedback from @mark?