[06:00:11] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:07:36] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:11:09] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:19:50] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) a:03Marostegui [06:23:39] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:24:45] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:36:18] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:38:30] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:46:00] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [06:52:23] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [07:46:38] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [07:49:47] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [07:54:59] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [08:03:01] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [08:05:49] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [08:06:06] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) This is all done, before closing it I am doing a quick data check on some tables across all wikis to make sure nothing has drifted. Progress [] s1 [] s2 [] s3 [] s4 [] s5 [x] s6 [] s7 [] s8 [] x1 [] es4 [] es5 [08:06:44] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) [08:19:15] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) GTID enabled everywhere on codfw masters: ` sudo cumin "P{P:mariadb::mysql_role%role = master and *.codfw.wmnet}" 'mysql -e "show slave status\G" | grep Using' 15 hosts will be targeted: db[2079,2090,2096,210... [08:20:32] 10DBA: Disconnect codfw -> eqiad replication - https://phabricator.wikimedia.org/T266663 (10Marostegui) And replication disabled everywhere on eqiad masters (pc1010 can be ignored): ` sudo cumin "P{P:mariadb::mysql_role%role = master and *.eqiad.wmnet}" 'mysql -e "show slave status\G"' 20 hosts will be targeted:... [08:34:25] marostegui: FYI `P:mariadb::mysql_role%role = master` can be written as `A:db-role-master` [08:35:01] ah, indeed, good point [08:35:22] * marostegui updates his notes [08:35:46] more minorly, `*.codfw.wmnet` is also available as `A:codfw` [08:37:55] notes updated to: "A:db-role-master and A:codfw" [08:37:56] \o/ [08:56:11] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10Marostegui) @ifried can we do: Wikidata on 17th Nov and then the 24th Nov on all wikis? I prefer to have the feature enabled on 1 big wiki first, before going fu... [09:42:03] 10Blocked-on-schema-change, 10DBA: Schema change to drop three indexes from wb_changes - https://phabricator.wikimedia.org/T264109 (10Marostegui) 05Open→03Resolved codfw msater is done, so this is all finished ` # ./section s8 | while read host port ; do echo "$host:$port" ; mysql.py -h$host:$port -A w... [09:53:26] 10DBA, 10Operations, 10Orchestrator: Repackage orchestrator - https://phabricator.wikimedia.org/T266763 (10Kormat) [11:20:36] there was anothed spike in s3 latencies with db1075 [11:20:59] at 11:01 [11:22:47] this is particularly interesting: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=23&orgId=1&var-server=db1075&var-port=9104&from=1603797758112&to=1603970558112 [11:29:59] jynus: see #mw_security, the timing looks suspiciously similar. [11:30:26] I am aware, but I don't think is related [11:30:40] this looks more like the issue after switchover [11:31:24] it could be indirectly related [11:31:46] but maybe not the root cause as "the node doesn't work well under extra load" [11:33:03] this could mean there is an inefficient workload, as it is very spiky: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db1075&var-port=9104&from=1603798361417&to=1603971161418 [11:34:12] I have a candidate query [11:34:29] I will paste it on a security ticket [11:45:15] see https://phabricator.wikimedia.org/T266775 [11:59:03] thoughts on making T263220 unbreak now? [11:59:04] T263220: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 [11:59:38] +1 [12:01:51] I have added it as this is ongoing [12:02:00] according to tendril [12:02:27] it is dpl clearly coming back, same query, same affected wiki [12:07:03] Am I dreaming or we had a template for new database creations? [12:20:38] I couldnt't find it anywhere, so I have created this at least to have it somewhere: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template [12:29:38] 10DBA, 10Operations, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10Marostegui) Thanks @RLazarus. pc1 is a bit different than the rest, as it has 2 hosts rather than the normal pc1008->pc2008 (pc2) or pc1009 -> pc2009. pc1 has the follo... [13:01:57] 10DBA, 10Orchestrator, 10Patch-For-Review: Populating orchestrator metadata on a per-server basis - https://phabricator.wikimedia.org/T266485 (10Marostegui) 05Open→03Resolved a:03Marostegui pc2 has been discovered and the alias has been set correctly: ` root@dborch1001:~# orchestrator -c discover -i 10... [13:02:05] 10DBA, 10Orchestrator, 10Patch-For-Review: Populating orchestrator alias metadata on a per-server basis - https://phabricator.wikimedia.org/T266485 (10Marostegui) [13:10:59] 10DBA, 10Operations, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10jcrespo) If I can provide more background, unless normal circumstances, pc* hosts are active-active, and no change should happen on them (no read only changes, etc.). Th... [13:51:01] 10DBA: querycache qc_type and qc_title have different nullabality on s1 only - https://phabricator.wikimedia.org/T265349 (10Marostegui) Self note, this is the alter I executed and need to happen on eqiad: ` alter table querycache change qc_type qc_type varbinary(32) NOT NULL, change qc_title qc_title varbinary(2... [15:35:06] Urbanecm: impact should be clear at https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db1075&var-port=9104 and I think it is [15:35:20] a spiky pattern is a hurtful one (normally) [15:36:37] jynus: great! So far it seems it helped, right? [15:37:00] Urbanecm: one second I can confirm with more metrics [15:37:45] sure [15:45:21] thank you a lot, Urbanecm! https://phabricator.wikimedia.org/T266775#6588735 [15:45:22] 10DBA, 10DynamicPageList (Wikimedia), 10MediaWiki-General, 10Security: Stalls on db1075 (s3) replica db - https://phabricator.wikimedia.org/T266775 (10jcrespo) I am mostly certain that this was the issue causing db1075 stalls, as processlist has decreased a lot (a slow query can be millions of times more i... [15:45:29] happy to help! [15:45:35] 10DBA, 10DynamicPageList (Wikimedia), 10MediaWiki-General: Stalls on db1075 (s3) replica db - https://phabricator.wikimedia.org/T266775 (10CDanis) [15:45:37] 10DBA, 10DynamicPageList (Wikimedia), 10MediaWiki-General: Stalls on db1075 (s3) replica db - https://phabricator.wikimedia.org/T266775 (10CDanis) [15:45:41] I wasn't even aware of this existing issue [15:47:30] 10DBA, 10DynamicPageList (Wikimedia), 10MediaWiki-General: Stalls on db1075 (s3) replica db - https://phabricator.wikimedia.org/T266775 (10jcrespo) 05Open→03Resolved a:03jcrespo I am going to consider this resolved, unless we were completely wrong and this wasn't the cause of the database stalls/connec... [15:47:53] 10DBA, 10DynamicPageList (Wikimedia), 10MediaWiki-General, 10User-Urbanecm: Stalls on db1075 (s3) replica db - https://phabricator.wikimedia.org/T266775 (10jcrespo) a:05jcrespo→03Urbanecm [15:56:16] 10DBA, 10DynamicPageList (Wikimedia), 10MediaWiki-General, 10Datacenter-Switchover, 10User-Urbanecm: Stalls on db1075 (s3) replica db - https://phabricator.wikimedia.org/T266775 (10jcrespo) Not related, but first incident was spotted during datacenter-switchover.