[02:53:50] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10WMDE-leszek) To bring some clarity on who said what, I believe the statement that SDC cannot work with Wikibase-related tables being on different server to which the discussion above refers to was made by myself in T68108#5268031.... [05:19:24] 10DBA: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 (10Marostegui) [05:26:53] db1114 (percona host) cannot be accessed from cumin, so I haven't moved it under db1083, I will tackle it later [05:38:24] yeah, permissions were a bit not final [05:38:44] no problem, I will move it after the failover :) [05:45:16] let me stop its replication completely, so it is invisible [05:45:29] there is no need to, I did all the moves already [05:45:40] it will just hang from db1067 and I will move it after the failover [05:46:34] yeah, I know, but I just want to be 100% sure it will not be a factor for the switchover [05:46:43] ok :) [05:46:57] it should not be, but I prefer not to take chances [05:47:01] +1 [05:47:31] how did it behave on --only ? [05:47:45] just warned and continued? [05:48:18] No, it never attempted to move it (I was actually wondering why) [05:48:27] ah, I know [05:48:36] if it cannot connect, it cannot test it is a replica [05:48:47] I thought it was using show slave hosts to gather the replicas [05:48:49] so it is assumed not, so not touched [05:48:51] and it does appear there [05:49:14] yes, but then it double checks the master shows on show slave status [05:49:35] aaaaah right [05:49:36] because we don't have report host option enabled [05:49:36] I see [05:49:53] so if it cannot connect it doesn't even error [05:49:54] I see [05:49:55] I don't like to do that, but it is a safer option for now [05:50:10] it does error, but I ignore it on switch [05:50:26] it probably whould warn at least [05:50:32] anyway, I am stopping it [05:50:36] +1 [05:50:44] I will lead and you do monitoring? [05:51:03] sure [05:51:54] db1067-bin.003024:424330511 [05:52:00] (stop location) [05:52:04] for db1114 [05:52:19] cool [05:52:56] see how even icinga will register it because it tries to execute show all slaves status [05:53:02] *will not [06:12:15] 10DBA, 10Operations, 10Patch-For-Review: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) [06:13:23] I am going to start replication again on db1114 [06:16:07] maybe easier to do it until? [06:16:17] (read only) [06:16:36] I need to stop db1067 replication anyways to finish an schema change, so I will move db1114 up [06:17:02] yes, I mean to replicate until read only coords, so you have for free the switch [06:17:24] not sure if you see what I mean [06:17:28] no :( [06:17:38] can I still blame jetlag? :) [06:17:40] so we should have the read only coords for both [06:17:51] it should be printed on switchover script [06:18:05] so start slave until 1067 coords, then switch master [06:18:29] to db1083 coords [06:18:51] the regular script won't work [06:19:05] oh, or you mean just stop replication on db1067 [06:19:09] that will work too [06:19:10] yeah [06:19:11] sorry [06:19:15] because I have to stop it anyways :) [06:19:20] yeah, sorry [06:19:22] so I just built the change master command [06:19:25] I didn't think about it [06:19:29] sure [06:19:51] actually, in that case, master stopped, the script would identify it too [06:20:01] but I have to check permissions [06:20:08] to see why it doesn't connect [06:20:17] some ssl stuff or or something [06:20:22] interesting, I did the change, but tendril doesn't update db1114 master [06:20:34] yeah, I also didn't touch tendril [06:20:46] I have to recreate it, and apply your patches [06:20:56] you can delete it if it annoys you [06:20:59] from tendril [06:21:08] nah, I will modify it manually to reflect the correct master [06:22:18] checking job-queue-health [06:22:27] is it giving issues? [06:22:45] not at the moment, but checking it will not [06:23:54] there was some lag during read only [06:26:38] 10DBA, 10Operations, 10Patch-For-Review: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) This was done. Read only start: 06:00:28 Read only stop: 06:01:39 Total read only time: 01:11 minutes [06:28:36] haha of course changing m_master_id directly on the servers table was too easy to make db1114 show its real master [06:31:10] is 171970661 the right new master ip? [06:31:33] yep [06:31:51] cool, that is what the threads are waiting on [06:37:25] mmm, dbtree down? [06:37:33] ah no [06:37:35] it was me [06:39:09] I reported https://phabricator.wikimedia.org/T238296 [06:39:32] oh [06:39:36] good one [06:41:01] 10DBA, 10Operations, 10Patch-For-Review: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T234800 (10Marostegui) 05Open→03Resolved [06:41:04] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [06:41:07] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:41:33] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10DannyS712) 05Stalled→03Open >>! In T210713#4983770, @Marostegui wrote: > Stalling this until we have failed over s1 master, as it is impossi... [06:42:02] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) Hehe, thanks! I am actually running the schema change already :) [06:42:11] see how read next spiked a bit during switch: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&from=1573702911576&to=1573713711576&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104 [06:42:40] 10DBA: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Marostegui) [06:43:06] also, at 6:30, puppet ran and switched master based on dynamic data on zarcillo [06:43:08] 10DBA: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Marostegui) db1067 is no longer a master, but let's wait a few days before actually start its decommissioning process. [06:43:36] weird spike indeed [06:43:39] we may want to run a thorough data check [06:44:25] i guess it is one of those queries that arrive to the master and hit the new master [06:44:27] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10DannyS712) >>! In T210713#5662714, @Marostegui wrote: > Hehe, thanks! I am actually running the schema change already :) [06:47:12] there is a few aborted clients, did you run the event changing scripts? [06:47:18] yep [06:47:34] let me see if they match old rate [06:48:35] yeah, those were common before: https://grafana.wikimedia.org/d/000000273/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1573627702836&to=1573714102836 [06:48:43] I was checking that too hehe [06:51:26] similare rate as before as "Timed out waiting for replication to reach {raw_pos}" [06:51:38] that's "good" [06:52:02] however, I am seeing complains of db1067 [06:52:11] which seems to be pooled while lagging [06:52:20] "Server db1067 has 2081.548388958 seconds of lag (>= 6)" [06:52:31] ah right [06:52:36] It has 0 weight [06:52:37] but it is pooled [06:52:39] let me depool it [06:52:54] just to aboid log spam [06:52:57] *Avoid [06:53:12] done [06:53:15] thanks for the heads up [06:54:03] there is a few complaints, mostly on codfw [06:54:39] and some on db1093 [06:55:00] db1093 isn't s1 I think [06:55:07] that's s6 indeed [07:31:47] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:53:32] 10DBA: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 (10Marostegui) [08:01:04] 10DBA, 10Core Platform Team, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Performance Issue, 10mariadb-optimizer-bug: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 (10Marostegui) >>! In T223151#5602487, @Marostegui wrote: > I have analyze... [08:02:19] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) db1067 has finally been altered: ` root@db1067.eqiad.wmnet[enwiki]> ALTER TABLE /*_*/change_tag MODIFY ct_tag_id int unsigned NOT NU... [08:02:40] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) 05Open→03Resolved [08:23:53] 10DBA, 10Patch-For-Review: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2133.codfw.wmnet', 'db2134.codfw.wmnet', 'db2135.codfw.wmnet'] ` The log can be found in `/var/log/... [08:46:43] 10DBA, 10Patch-For-Review: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2133.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2133.codfw.wmnet'] ` [08:49:24] 10DBA, 10Patch-For-Review: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2133.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911140849_marostegui_198... [08:51:30] 10DBA: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 (10Marostegui) [09:13:23] 10DBA: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2133.codfw.wmnet'] ` and were **ALL** successful. [09:15:49] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Ladsgroup) Let me clarify one thing here, by not possible I mean, it will be possible after one day of work and two days of test. [09:46:06] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) I think it should be said that right now we are not in a super urgent position to get things cleaned up/split/moved or something similar. But we need to make sure we are aware that at some point commonswiki will no lon... [15:55:28] 10DBA, 10MediaWiki-Logging, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), and 3 others: Page creation log cannot be viewed from oldest records, Fatal: "execution time limit of 60 seconds was exceeded" - https://phabricator.wikimedia.org/T237026 (10Anomie)... [16:43:24] 10DBA, 10MediaWiki-Logging, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), and 3 others: Page creation log cannot be viewed from oldest records, Fatal: "execution time limit of 60 seconds was exceeded" - https://phabricator.wikimedia.org/T237026 (10Marostegu... [17:13:22] jynus or marostegui, we're going to work on T237509, can you confirm that you're not doing anything that will conflict? [17:13:23] T237509: maintain views has to run or has to be updated to fix errors on globalblocks and protected_titles for wikireplicas - https://phabricator.wikimedia.org/T237509 [17:14:37] probably better if done tomorrow, there is an alter running now in the replicas [17:14:42] might take a few more hours [17:15:54] andrewbogott: I will fix protected titles tomorrow as well, as I need to run that script tomorrow anyways [17:16:21] marostegui: ok — does that mean I should just ignore this and everything will get caught tomorrow? [17:17:06] andrewbogott: notal for globalocks [17:17:15] that still needs fixing by WMCS [17:17:33] ruwiki_p.protected_titles will be fixed by me tomorrow [17:17:50] marostegui: my understanding was that all I needed to do for that task was depool, run maintain_views, repool on each db server [17:18:02] with —clean —all-databases [17:18:05] so it's all or nothing isnt' it? [17:18:09] * andrewbogott is probably missing something [17:19:21] don't know, what I do is --replace for the other tables, don't know for globallocks [17:20:11] don't know if --clean will get stuck till the alter finishes on: waiting for metadata locking [17:20:19] maybe it is worth waiting for tomorrow [17:20:24] once the alters are fully done [17:20:28] We should definitely wait [17:20:33] cool [17:20:40] I will fix protected titles tomorrow [17:20:40] I just think that after you make your changes we can get by with one round of depooling/updating [17:20:49] ok! lmk when you're ready and we'll coordinate [17:20:52] ok! [17:20:54] will do [17:20:55] thanks [18:11:48] I am fiddling with dbctl schemata right now, please don't push anything [18:19:36] done [18:20:00] for reference -- https://phabricator.wikimedia.org/T233236 and https://phabricator.wikimedia.org/P9638 [20:18:28] a heads up DBAs, section 'wikitech' has been replaced by section 's10'. [20:50:47] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0): Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Anomie)