[06:02:05] 10DBA: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 (10Marostegui) [06:14:03] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0): Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Marostegui) p:05Triage→03Normal [08:31:47] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0): Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Marostegui) This schema change can probably be done directly on the master, as the tables are quite small. Maybe `metawiki` will be done on the replicas (96M on... [09:08:55] 10DBA, 10Data-Services: Reimport wikidatawiki.pagelinks on labsdb1010 - https://phabricator.wikimedia.org/T238399 (10Marostegui) [09:10:28] 10DBA, 10Data-Services: Reimport wikidatawiki.pagelinks on labsdb1010 - https://phabricator.wikimedia.org/T238399 (10Marostegui) p:05Triage→03Normal a:03Marostegui I will try to do this next week using mydumper (the table is 91GB) - I will have labsdb1010 depooled during the re-import time. We will proba... [10:46:22] 10Blocked-on-schema-change, 10DBA, 10Core Platform Team: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 (10Marostegui) [10:56:43] this may be causing the spikes of errors for database replication check from mw https://phabricator.wikimedia.org/T231011 [10:58:07] but wouldn't that show euwiki on top of the wikis giving errors too? [10:58:46] it is not euwiki [10:58:51] it is wikidata [10:59:01] but it affects app servers [10:59:03] not dbs [10:59:23] however, app server overload may make unreliable lag measures [10:59:53] beacause mw load balancer checks real workds clock lag, not transactional time lag [11:01:17] * marostegui subscribed to the task [11:02:03] my bet is that when that is solved, the regular "all dbs are lagged" will go away [11:02:16] as we didn't see that from the db point of view [11:02:20] yeah [11:02:42] we'll see [11:06:30] moritzm: what should we do with dbproxy2001, 2002 and db2004? do you want me to reboot them? [11:11:19] yes, I haven't found the time yet, that would be great [11:11:25] ok! [11:11:42] No problem! [11:11:48] I noticed thought that 2001-2003 use mariabd::proxy::master role and 2004 is spare [11:11:53] yep [11:12:03] ok, just wanted to mention :-) [11:12:08] maybe upgrade everything first? [11:12:14] yep, that's the idea [11:12:20] sounds good, thanks [11:12:25] as there may be kernel updates and stuff [11:12:50] if the alert about the microcode isn't silenced after reboot I'll have a closer look, we had a few servers which needed firmware/BIOS updates [11:13:03] we'll see [11:13:09] I am going to do dbproxy2001 first [11:13:18] ack, thx [11:17:29] moritzm: I have rebooted dbproxy2001 and 2004 (I have downtimed them) we'll see if the alert clears [13:22:38] 10DBA, 10Data-Services: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 (10Marostegui) [14:08:39] marostegui: btw. I improved the reads on the database for s8 by putting a cache on top of one of the modules (see https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?panelId=2&fullscreen&orgId=1&from=now-48h&to=now and click on "select.SqlEntityInfoBuilder_collectTermsForEntities") but the spikes are still there [14:16:32] oh wow, nice one [14:17:34] I was trying to see if that is also noticiable on the slaves graphs [14:45:40] 10DBA, 10Data-Services, 10Operations, 10cloud-services-team (Kanban): Prepare and check storage layer for gcrwiki - https://phabricator.wikimedia.org/T238114 (10Dzahn) p:05Triage→03Normal [14:45:57] 10DBA, 10Data-Services, 10Operations, 10cloud-services-team (Kanban): Prepare and check storage layer for shywiktionary - https://phabricator.wikimedia.org/T238115 (10Dzahn) p:05Triage→03Normal [15:10:23] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0), 10Patch-For-Review: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Anomie) >>! In T238370#5666300, @Marostegui wrote: > The table being created: `oauth2_access_tokens` I assume it will be private, right?... [16:10:28] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0), 10Patch-For-Review: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Marostegui) >>! In T238370#5666914, @Anomie wrote: >>>! In T238370#5666300, @Marostegui wrote: >> The table being created: `oauth2_access_t... [16:36:56] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0), 10Patch-For-Review: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Anomie) For the record, ` anomie@mwmaint1002:~$ sql labswiki Reading table information for completion of table and column names You can tur... [16:55:53] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0), 10Patch-For-Review: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10jcrespo) I doubt labtestwiki has replicas... [17:02:11] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0), 10Patch-For-Review: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Marostegui) labtestwiki lives somewhere within cloud land, I don't remember exactly the hostname, let's wait for @Andrew. I guess `sql` to... [17:45:35] marostegui, jeh, is it time to start rebuilding our views? [17:48:00] andrewbogott: anytime works for me [17:48:25] andrewbogott: green light from my side [17:51:06] ok, I'll narate what I'm doing. First, going to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550878/ [17:51:24] um, wait... [17:51:28] * andrewbogott tries to find the first patch in that set [17:51:51] that would be https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550888/ [17:54:22] applying those patches on labsdb1009-1012 [17:55:42] now running "systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio" on dbproxy1011 [17:56:30] I'm not sure what "wait for depool to take effect (check with socat /run/haproxy/haproxy.sock stdio)" means [17:56:53] jeh, do you know what the 'check with' means in this context? [17:57:37] yeah, you should see that the host being depooled is no longer a backend host [17:58:26] did you run puppet agent on dbproxy1011? [17:58:37] hm... [17:58:39] maybe not, will retry [17:59:08] IIRC it should remove labsdb1009 from /etc/haproxy/conf.d/db-replicas.cfg [17:59:37] my main point of confusion is that 'socat /run/haproxy/haproxy.sock stdio' just returns without any output [17:59:55] ah, try `echo "show stat" | socat /run/haproxy/haproxy.sock stdio` [18:00:16] that looks more useful [18:00:30] ugly paste ahead: [18:00:32] https://www.irccloud.com/pastebin/kF8UqYQG/ [18:00:35] Looks like it worked though [18:00:45] yep, labsdb1009 is out [18:00:46] * andrewbogott updates docs [18:02:08] ok, now on labsdb1009… I see some tools with connections in 'sleep' state. [18:02:14] I will kill them! [18:02:38] um… am I doing a normal bash 'kill' or something specific to mysql? [18:04:03] no, within mysql you'll use `kill ` [18:04:54] ok [18:05:14] and now, in a screen, going to run "# maintain-views --all-databases --all-tables —clean" [18:05:20] and then wait many hours, presumably [18:05:55] yep, I'll attach to the screen session too [18:06:04] hm, I guess there's no —all-tables [18:06:14] so just maintain-views --all-databases --clean [18:06:29] uhoh, is this going to prompt me to replace every view? [18:06:41] maybe `--replace-all` too? [18:06:43] * jeh is not sure [18:06:54] yep, that's better [18:06:59] * andrewbogott thought that —clean would cover that [18:08:07] now, lunchtime I guess [18:09:30] looks like clean is for `Clean out views from _p db that are no longer specified`, which is part of what we wanted here [18:09:47] makes sense [18:47:50] jeh: the run finished and I see "Table 'enwiki_p.globalblocks' doesn't exist" [18:47:54] is there anything else you would check? [18:49:18] I didn't expect that message since we removed that table from maintain_views.yaml [18:49:32] that's in response to select * from globalblocks; [18:49:37] just making sure it's gone [18:49:44] oh ok cool [18:49:55] I'm repooling 1009 and will depool 1010 [18:50:27] I'm not sure of what other checks would be useful off hand [18:54:50] ok, I'm going to try to stop pinging you but I'll log my progress in -cloud [18:55:22] ok [19:39:55] 10DBA, 10Operations, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) [19:44:28] 10DBA, 10Operations, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) @DBA fyi. I suggest i can puppetize that Andre gets a my.cnf written to his home dir somewhere with the existing "metrics_user" f... [19:52:45] 10DBA, 10Operations, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) a:03Dzahn [19:52:53] 10DBA, 10Operations, 10SRE-Access-Requests: Read access for aklapper to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) p:05Triage→03Normal [20:57:38] 10DBA, 10Operations, 10SRE-Access-Requests: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) [22:57:09] 10Blocked-on-schema-change, 10DBA, 10CPT Initiatives (OAuth 2.0), 10Patch-For-Review: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 (10Andrew) >>! In T238370#5667323, @Marostegui wrote: > labtestwiki lives somewhere within cloud land, I don't remember exactly the hostname,... [23:01:07] 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, 10Documentation, and 2 others: Prepare a disaster recovery plan for failing over Phabricator - https://phabricator.wikimedia.org/T190572 (10mmodell) Some scenarios that we should describe and test: 1. A simple failure of the phabricator server, e.g. a... [23:28:02] 10DBA, 10Operations, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Agusbou2015) There is a typo on the date: it should be "Tue 26th Nov", not "Tue 24th Nov", despite the correct date is shown in the title. [23:51:54] 10DBA, 10Operations, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10JJMC89)