[05:09:52] 10DBA, 10Operations, 10Patch-For-Review: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) Reserved window on the Deployments page: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837698&ol... [05:43:53] 10DBA, 10Operations: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) [06:01:42] 10DBA, 10Operations, 10Patch-For-Review: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) p:05Triage→03Normal [06:03:43] 10DBA, 10Operations, 10Patch-For-Review: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) [06:05:17] 10DBA, 10Operations: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:10:05] Access denied for user '' to database 'labtestwiki' [07:10:31] this is on cloudweb2001-dev [07:12:28] wikiadmin? [07:12:55] I am not sure that should have production access however [07:13:23] So that host is trying to access wikitech [07:13:33] I guess that is a result of the wikiadmin pass change [07:13:52] but I don't think that hosts should have that passs [07:14:02] a production pass on a testing host on cloud? [07:14:15] I don't know if that is a testing host, is it? [07:14:17] so anyone with access to that project has the production pass? [07:14:29] the -dev says so [07:14:54] that is https://labtestwikitech.wikimedia.org/ [07:15:06] which should be only a test host for wikitech [07:15:18] codfw1dev deployment [07:15:18] This is our testing/stagging/devel openstack deployment. [07:15:18] The setup is a mirror of the eqiad1 deployment. [07:15:18] Current server list: [07:15:40] I assume it is testing but for the current WMCS team, no? [07:15:46] Not people outside of it [07:15:55] Maybe bd808 can clarify that ^ [07:16:16] so anyone that gets given access to the project [07:16:32] as per the project owners discretion [07:16:48] not the formal access deployers/production access get [07:17:19] I don't know how that'd work [07:17:24] The granted access list [07:17:30] if it is staging/dev, it should not have production access [07:19:49] I am going to comment on https://phabricator.wikimedia.org/T227476 [07:20:36] cool [07:22:57] https://phabricator.wikimedia.org/T227476#5494715 [07:23:31] thanks [07:24:53] interesting, there are some errors on wikitech too, but for web requests [07:25:05] server: wikitech.wikimedia.org [07:25:17] rror connecting to 10.64.32.12 as user wikiuser: Can't connect to MySQL server on '10.64.32.12' [07:26:00] but that's an access denied or a cannot connect? [07:26:24] error 110 [07:26:50] OS error code 110: Connection timed out [07:27:23] wikitech connecting to an s4 slave? [07:27:34] ah! [07:27:45] it is the access we have banned to commons [07:27:47] for images [07:27:59] until it is on a s* section [07:28:29] or actually, until it is on a mw production host [07:28:43] known issue [07:30:23] :) [07:51:11] 10DBA: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10Marostegui) [07:53:57] 10DBA, 10Core Platform Team, 10MediaWiki-Page-derived-data, 10TechCom-RFC, and 2 others: Normalize MediaWiki link tables - https://phabricator.wikimedia.org/T222224 (10jcrespo) > the size it consumes on disk No, the size that creates an unnecessary large amount of iops, memory and cpu cycles, causing perf... [07:58:37] 10DBA, 10Patch-For-Review: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1021.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [08:16:11] 10DBA: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1021.eqiad.wmnet'] ` and were **ALL** successful. [08:41:35] So I see we report 6 masters pending to switchover, but I only see 5, which one am I missing? (61, 62, 66, 67, 70) [08:45:00] My bad probably then [08:45:02] I will check later [08:46:07] 66 isn't replaced yet as of today [08:46:46] no, I mean the above are pending [08:46:48] I don't know [08:46:51] I will check later [08:47:05] I have probably counted wrongly, will check once I am done with what I am doing now [09:05:30] fixed it [09:41:49] marostegui: hey, when you have time, please remake the file and merge the puppet patch to get the party started [09:41:50] https://www.youtube.com/watch?v=mW1dbiD_zDk [09:44:31] Amir1: sure, doing it [09:44:58] thanks [09:45:55] Amir1: done [09:46:07] Commented on the task too [09:46:33] let me know if that's all you need [09:49:17] thanks! [10:03:31] marostegui: This patch needs merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/535526 :D [10:03:38] We forgot to start it [10:59:35] Amir1: I am going to get some lunch, can I do it in 1h? [10:59:46] Amir1: can you rebase it? [11:02:06] Sure, I'm going lunch too [11:44:08] Amir1: I tried to rebase on gerrit but doesn't work, I think you have to do it locally and then send the patch again [11:57:44] 10DBA, 10Patch-For-Review: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10Marostegui) [11:58:17] 10DBA, 10Patch-For-Review: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 (10Marostegui) dbproxy1021 has been placed in m5: ` root@cumin1001:~# mysql --skip-ssl -h dbproxy1017 -e "select @@hostname" +------------+ | @@hostname | +------------+ | db1133... [12:07:09] Back from lunch, let me check [12:09:02] sure [12:20:18] marostegui: fixed [12:20:45] checking [12:21:09] cool, as soon as CI verify the change I will merge it [12:22:56] Amir1: merged! [12:23:56] Thanks! [12:48:07] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) [12:51:10] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) [12:53:08] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) Do you foresee the need of having an index (or replacing the current ones) with those new columns? Also subscribing @Anomie here as he's been working on the migra... [12:53:18] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) p:05Triage→03Normal [12:56:48] 10DBA, 10Operations: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) Reserved window on the deployments calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837750&oldid=1837737 [12:57:03] 10DBA, 10Operations: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) Reserved window on the deployments calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837750&oldid=... [13:00:54] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Reedy) OOI, how deterministic is two columns using `AFTER cuc_private` in the same schema change? Does it insert one after `cuc_private`, then then insert the second, pushing... [13:02:40] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) [13:04:07] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) >>! In T233004#5495636, @Reedy wrote: > OOI, how deterministic is two columns using `AFTER cuc_private` in the same schema change? Does it insert one after `cuc_private`... [13:05:06] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) There is actually not `cuc_private` column from what I can see on enwiki. Am I missing something? ` root@cumin1001:~# mysql.py -hdb1089 enwiki -e "show create tab... [13:09:13] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) >>! In T233004#5495664, @Marostegui wrote: > There is actually not `cuc_private` column from what I can see on enwiki. > Am I missing something? > ` > root@cumin1001:~#... [13:09:57] akosiaris: how crazy would it be to reimage backup* hosts into buster? [13:10:20] jynus: none at all. Go for it now that we can [13:10:27] that was my thought [13:10:32] * akosiaris crosses fingers :P [13:10:53] it has some mild inconveniences but it prevents larger ones later [13:13:50] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) >>! In T233004#5495665, @Rxy wrote: >> > > O_o I guess last db schema change patch does not applied WMF production environment IMHO... > > rECHU [/archives/p... [13:14:41] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) @Rxy If that column is merged but not in production, I guess it is no used as the change is from 2012, and that code can be cleaned up? [13:16:50] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Reedy) I guess it's not a feature we use with `$wgCUPublicKey` being `""` by default. So for the wikis without the column, we're not trying to read it, hence it not being no... [13:19:04] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) >>! In T233004#5495718, @Reedy wrote: > > Most of the newer wikis should have the column though Correct, the new ones do have it ` root@cumin1001:~# mysql.py -h... [13:21:37] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Reedy) >>! In T233004#5495722, @Marostegui wrote: > The usual mess with half applied schema changes :( Well, not quite. It's just any wikis that were created with the `cu_ch... [13:26:33] 10DBA, 10CheckUser, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Marostegui) I can easily do an `ADD COLUMN IF NOT EXISTS` or `DROP COLUMN IF EXISTS` and create/drop that `cuc_private` column as part of these schema changes if we find out... [13:38:28] It's running [13:38:41] https://www.irccloud.com/pastebin/FD0C6eUQ/ [13:43:16] 10DBA, 10CheckUser, 10Core Platform Team, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Anomie) [13:50:24] thanks Amir1 [13:52:06] Amir1: there is this spike: https://logstash.wikimedia.org/goto/353ae696b401fe30c1eda5ec575021aa but not sure if related [13:53:59] Yeah, it's because of it because the number of inserts are rather high. [13:54:36] can that be throttled? [13:55:18] I can put more sleep or reduce batch size [13:55:58] any of those make running this thing go slower, already it's going to take a month [13:56:02] but "found writes pending" is not a contention problem [13:56:21] that would be if it had problems locking or wirting or deadlocks [13:56:39] that, AFAIK is "the commit logic is incorrect" [13:56:49] oh that's scary [13:56:51] because transactions are being open in the middle of a transaction [13:57:33] but that is happening mostly on centralauth, hence my question if it is related to this migration [13:57:52] I would say that correlates better with hashar's deployment [13:58:00] :-\ [13:58:13] I promoted all wikis to 1.34.0-wmf.22 yeah [13:58:22] The wikidata errors are "Lock wait timeout exceeded; try restarting transaction (10.64.48.172)" [13:58:23] but of course, that is a guess based only on time stamps [13:58:37] Amir1: correct, that is contention and it is not tahta bad* [13:58:41] though wikidata should have been promoted last week on wednesday [13:59:00] the other is a logical errror and could lead tod data loss [13:59:02] oh, if it's problematic, I can make batches smaller [13:59:24] that increase starts around 13:35 or so [13:59:35] which kkinda matches hashar entry on SAL [13:59:42] yeah [13:59:43] * marostegui going to a meeting, but will keep an eye on this [14:00:19] hashar: did you rolled back? I don't see it happening anymore [14:01:27] jynus: rolled back what? The train? no :) [14:01:42] it had like a spike of a few minutes and then it stopepd [14:01:52] maybe some specific jobs? [14:01:56] yeah [14:01:58] ah no, it was coming from mw [14:02:10] also jobqueue [14:02:19] centralauth related, several wikis [14:02:54] hashar: feel help us keep an eye on those "found writes pending" [14:03:04] s/feel/please/ [14:04:03] 10DBA, 10CheckUser, 10Core Platform Team, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Anomie) >>! In T233004#5495619, @Marostegui wrote: > Do you foresee the need of having an index (or replacing the current ones) with those new columns... [14:11:23] 10DBA, 10CheckUser, 10Core Platform Team, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) [14:12:41] 10DBA, 10CheckUser, 10Core Platform Team, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Rxy) [15:07:21] 10DBA, 10Operations, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [15:07:46] 10DBA, 10Operations, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be fo... [15:10:53] 10DBA, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green): Compose Count Queries - https://phabricator.wikimedia.org/T231598 (10Marostegui) >>! In T231598#5465711, @Anomie wrote: > >> [ ] anonedits: total number of anonymous edits > > `lang=sql > SELECT COUNT(DISTINCT revacto... [15:12:09] stuck at "Loading debian-installer/amd64/initrd.gz..", sounds familiar? [15:14:47] 10DBA, 10Operations, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be found in `/var/log/wmf-a... [15:16:01] 10DBA, 10Operations, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Got stuck at kernel boot, could it be the same issue as T216240 ? [15:17:41] It booted now, I think "Debian 10 (buster) amd64 (Wikimedia edition)" [15:28:49] 10DBA, 10Operations, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10Marostegui) >>! In T229209#5496239, @jcrespo wrote: > Got stuck at kernel boot, could it be the same issue as T216240 ? Maybe, even if it is not, it wouldn't hurt to get t... [15:31:29] 10DBA, 10Operations, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10MoritzMuehlenhoff) >>! In T229209#5496266, @Marostegui wrote: >>>! In T229209#5496239, @jcrespo wrote: >> Got stuck at kernel boot, could it be the same issue as T216240 ?... [15:39:02] 10DBA, 10Operations, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) In other order of things, the RAID controller I think now has a random device id, so the boot installer failed. I am not sure we will be able to install it without... [15:43:56] 10DBA, 10Operations, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Sadly, I cannot setup the RAID remotelly, because the server no longer boots and mgmt interface says: ` Unified Server Configurator does not support console redir... [16:19:26] Amir1: lots of lock wait timeout on INSERT INTO `wbt_item_terms`, I assume it is your script [16:27:36] yup, if it's too much, let me know [17:15:25] 10DBA, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green): Compose Count Queries - https://phabricator.wikimedia.org/T231598 (10Anomie) Note that's the wrong version of the query. But the correct version uses the same plan: ` wikiadmin@db1114(enwiki)> explain SELECT COUNT(*) F... [19:17:22] 10DBA, 10CheckUser, 10Core Platform Team, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Reedy) [19:21:41] 10DBA, 10CheckUser, 10Core Platform Team, 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Reedy) 05Open→03Stalled Needs gerrit patch creating and merging before DBA's will action...