[05:22:09] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Legacy (Watching / External), 10Goal, and 4 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10Krinkle) [06:37:15] 10DBA, 10Patch-For-Review: Drop 'designate_pool_manager' database from m5 and remove associated grants - https://phabricator.wikimedia.org/T233978 (10Marostegui) Grants removed for the `designate_pool_manager` DB: ` root@db1133.eqiad.wmnet[(none)]> show grants for 'designate'@'208.80.154.12'; +---------------... [06:46:07] 10DBA, 10Patch-For-Review: Drop 'designate_pool_manager' database from m5 and remove associated grants - https://phabricator.wikimedia.org/T233978 (10Marostegui) 05Open→03Resolved I have dropped the database. It didn't have much data: ` mysql.py -hdb1117:3325 designate_pool_manager -e "select count(*) from... [08:02:30] T224422#5562296 ? [08:02:31] T224422: Implement logic to filter bogus GTIDs - https://phabricator.wikimedia.org/T224422 [08:07:02] I just responded [08:08:40] so just to be clear, I don't think that will get them "totally cleared" [08:08:58] but synced with the master good enough to prevent the errors [08:09:04] sure [08:09:10] I think you got my idea [08:10:22] so not only M-M-X, but other stuff(M-M-X,M2-M2-Y), as long as it is the same on both hosts [08:10:22] M2: Confirm MediaWiki Account Link - https://phabricator.wikimedia.org/M2 [08:32:02] what if I use db1084, currently depooled and down and i return it to you in a good state? [08:32:56] sounds good to me! [08:33:21] ok, taking over db1084, will ping you when finished [08:33:27] cheers [08:58:46] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms, 10Core Platform Team Legacy (Watching / External), and 5 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10Gilles) [09:07:49] sorry to disturb you manuel, but could you check my notes, because I cannot belive mariadb gtid is broken beyond what we initially thought it was [09:08:10] what's up [09:08:19] see our shared etherpad [09:08:23] checking [09:08:39] starting on line 64 [09:08:47] but you don't have to read everthing [09:08:58] it is just there to check I didn't do anything wrong [09:09:09] db1138 is the master, right? [09:09:14] yes [09:09:29] you can jump to line 134 [09:10:17] my theory that replica had extra events not found in the master is not true [09:10:31] they are the same, but anyway I reset everything [09:10:38] just in case [09:10:47] then did the thing based on the manual [09:12:27] oh, I see [09:12:35] did you clean up all the gtid tables as well? [09:12:36] I was checking against io [09:12:50] it works it I check against slave_pos [09:13:01] yeah, we use slave_pos [09:13:08] I believe the io one is used for the current_pos [09:13:09] it is a mess [09:13:26] so it works [09:13:35] it is confusing but it should work as intended [09:13:45] but why do you have again gtids from codfw? [09:13:47] then why it fails on mediawiki? [09:14:21] those will be heartbeat from the master or when master was there [09:14:32] just to be clear, I didn't want to fix the gtid contamination [09:14:36] you already worked on that [09:14:39] yeah [09:14:59] I was just curious because last time we were on codfw it was more than a year ago [09:14:59] I just wanted to start from fresh thinking the weait wouldn't work [09:15:30] but it works based on the manual and local run [09:15:42] but fails based on mw logs [09:16:18] could it be then it is just a log issue? [09:16:29] what do you mean log issue? [09:16:50] "Timed out waiting for replication to reach 171966557-171966557-578966402,171978775-171978775-4822899280,171978876-171978876-58874515 " [09:17:02] but the query was [09:17:03] SELECT MASTER_GTID_WAIT('171978876-171978876-100850438', 1) [09:18:31] wait, that is a weird log + query, isn't it? [09:18:49] it should be waiting on 171978876-171978876-58874515 [09:18:55] not 100850438 [09:19:04] so it may be a code issue [09:19:14] am I crazy? [09:19:58] for once it may not be a mariadb issue? [09:20:53] so 171978876 makes sense because it is db1138 [09:21:25] yeah, but set id is different [09:21:40] however, in your example at https://phabricator.wikimedia.org/T224422#5558330 it is the same [09:22:44] ok, I will comment on the ticket with what I discovered, and shutdown db1084 [09:22:49] yeah, it makes no sense [09:22:54] but I am not entirely sure it is a code issue [09:23:11] I belive the initial idea is a good workaround but there may be an implementation issue [09:23:55] how's the query being built up? [09:27:40] so I am going to report that, and ask to be looked at [09:27:51] it may be a missleading thing [09:28:04] yeah, let's see what they say with your report and my report [09:28:08] but I think needs to be corrected first before further investigation [09:29:49] actually I am wrong, the error log is correct [09:29:53] so it is something else [09:30:12] I was comparing 2 different errors by mistake [09:30:54] 171978876-171978876-108546562 vs 171978876-171978876-109087539 ? [09:39:26] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms, 10Core Platform Team Legacy (Watching / External), and 4 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10daniel) Patch was merged, removing the patch for review tag.... [09:39:49] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms, 10Core Platform Team Legacy (Watching / External), and 4 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10daniel) Not sure what CPT can do here. Tagging for triage. [10:12:22] 10Blocked-on-schema-change, 10DBA, 10Core Platform Team: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 (10Marostegui) [10:12:24] 10Blocked-on-schema-change, 10DBA: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 (10Marostegui) [11:41:29] 10DBA: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 (10Marostegui) [12:37:04] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Reclone labsdb1011 - https://phabricator.wikimedia.org/T235016 (10Marostegui) labsdb1011 has been recloned. I am letting it to catch up a bit (it is 7h delayed) before repooling it. [12:47:11] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Reclone labsdb1011 - https://phabricator.wikimedia.org/T235016 (10Marostegui) For the record I have also documented briefily how to reclone one of the wikireplicas: https://wikitech.wikimedia.org/w/index.php?title=MariaDB&diff=prev&oldid=1840626#Recloni... [13:40:52] lots of errors on db1087 [13:41:30] it seems to be constantly lagging [13:41:46] could be because of all the migration? [13:41:57] not sure, but it has a lots of processes [13:42:25] https://grafana.wikimedia.org/d/000000273/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1087&var-port=9104&from=now-24h&to=now [13:42:26] nice [13:42:40] purge list is growing [13:42:47] maybe I should depool it [13:42:52] I am going to start repooling hosts [13:42:55] That were out for PDU [13:43:06] I am finishing with labsdb1011 [13:43:11] but there are other hosts with more qps [13:43:14] let check if the hosts are now up-to-date with replication [13:43:16] this one ins the only one lagging [13:43:22] 1-2 seconds all the time [13:43:45] there is one query that's been there for 12h now [13:43:49] I am gonig to kill that one [13:43:57] admin? [13:44:01] yep [13:44:03] gather its data, ip, etc [13:44:07] please [13:44:22] yup [13:44:50] killed [13:44:59] I will investigate and open a ticket/contact the owner [13:45:08] can you share its data even if on private [13:45:15] I will let you do that [13:45:18] https://phabricator.wikimedia.org/P9298 [13:45:30] but to see if it was the cause of the wikidata isssues [13:45:44] I see [13:45:48] probably related [13:45:51] yeah [13:45:57] will paste the finding [13:46:00] matches the graphs [13:46:07] ah, you will do that then? [13:46:16] well, just the update to the ticket [13:46:24] although there is not much to investigate [13:46:30] I am confused, not sure what you are referring to now [13:46:31] it is a slow query from cron [13:46:51] yes, I know that [13:46:58] which matches the start of the issues at 0:00 [13:47:11] the query just started again, I am going to kill the process [13:47:12] I will update the ticket, let you know if you want to do something else [13:47:16] on mwmaint [13:50:00] done [13:50:24] this is what I commented: https://phabricator.wikimedia.org/T234948#5562953 [13:50:45] feel free to do any other changes or research [13:50:46] I will provide more details now [13:51:07] I will focus on making sure things are back to normal [13:51:12] ok thanks [13:51:35] they are [13:51:37] no more lag [13:51:59] this links to the previous thing I was doing, the chronology protector [13:52:09] and the reason why there were so many errors [13:52:38] I will comment on that ticket too [13:53:53] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1087&var-port=9104&from=now-15m&to=now&panelId=3&fullscreen [13:53:56] that is back to normal [13:55:28] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Reclone labsdb1011 - https://phabricator.wikimedia.org/T235016 (10Marostegui) 05Open→03Resolved Host repooled [14:17:56] did you repool all hosts after the pdu stuff, or are they some depooled or with low load? [14:18:15] still repooling [14:18:20] warming them up [14:18:25] ok, I wanted to do some research for the gtid [14:18:30] but I will wait for a more normal state [14:18:33] no hurry [14:18:34] sure [14:18:58] I was thinking then to upgrafe dbmonitor hosts, starting with the pasive one [14:19:20] these are the apaches for tendril, if you are momentarilly lost [14:19:38] Didn't moritzm do them recently? [14:19:50] I think he does regular updates [14:19:57] I was thinking of doing a buster upgrade [14:20:02] as apache should be stable [14:20:15] and they will call us on them being jessie still [14:20:20] but maybe I am wrong [14:20:20] sure [14:20:32] yeah, so far I only deployed sec updates, but happy to help with that wrt generic OS questions/reviews [14:20:42] jessie is in fact going away by end of March [14:20:46] not preciselly very pressing, but I am blocked on a few other more important things [14:20:56] moritzm: I was trying to make you happy :-D [14:21:02] jynus: sure, if you have time to kill, go for it [14:21:13] "having time" is relative [14:21:15] :-P [14:21:20] you got the idea I think [14:21:24] :-D [14:21:36] I think you wanted a gif [14:21:46] I vaguely remember that we had some specific php5 cornercase in there? or did that go away with the tendril -> zarcillo migration? [14:21:47] I will only give you one if I succeed [14:22:13] it had some issue when it lived on icinga [14:22:26] but it was fixed mostly by alex when upgraded to jessie [14:22:28] ah, yes, probably that [14:22:32] I expect less issue this time [14:22:47] and I also expect our apache module to be in a good state [14:22:59] among all of them, we'll see [14:24:05] oh, puppet maintenance is ongoing, I may have to wait [14:25:30] given that these are on Ganeti, there's also the option to create debmonitor1002 in parallel and then switch over [14:25:39] dbmonitor1002 :-) [14:25:45] nah [14:25:51] dbmonitor1001 is passive [14:25:57] sorry [14:25:59] I meant 2001 [14:26:05] it should be good to go [14:26:18] ah yes, that's also a good option [14:26:24] and when we do dbmonitor1001, we would be ready [14:26:42] also, way easier compared to a database upgrade [14:26:50] for us is like a child's play! [14:27:11] compared to data/db/monitoring migration [14:27:22] hehe :-) [14:43:12] 10DBA, 10Operations, 10User-notice: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Xaosflux) [14:45:28] marostegui: reminder that dbctl will run much faster on cumin2001 :) [14:45:48] yeah :) [14:45:51] I tested it the other day actually [14:45:57] I will use it for the switchovers [14:46:09] but on a daily basis I'm on cumin1001 normally :) [14:50:31] fair enough :) [16:29:58] 10DBA, 10Performance-Team, 10conftool: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 (10Krinkle) a:03aaron I'm not familiar with what's being asked exactly, deferring to @aaron for now. Happy to help later if needed. [16:50:54] 10DBA, 10Operations, 10User-notice: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Johan) This only affects English Wikipedia, right? [16:52:01] 10DBA, 10Operations, 10User-notice: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) Yep! [18:55:12] 10DBA, 10Data-Services, 10Operations, 10cloud-services-team (Kanban): Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10bd808) a:03bd808 [19:14:46] 10DBA, 10Data-Services, 10Operations, 10cloud-services-team (Kanban): Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10bd808) 05Open→03Resolved `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --database nqowiki --debug $...