[05:30:33] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) [05:30:51] 10DBA, 10Operations, 10ops-eqiad, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:30:53] 10DBA, 10Goal, 10Patch-For-Review: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 (10Marostegui) 05Open→03Resolved All these hosts are now provisioned [05:37:59] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [05:38:11] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) [05:47:02] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) I have dropped this table from s3 (testwikidatawiki) which wasn't w... [05:47:31] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) [05:47:42] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) [05:48:17] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) [05:56:58] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) Deletion process for s8 (wikidata). The table is 6GB there. Not wri... [06:09:40] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) [06:09:51] 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [06:09:54] 10DBA, 10Beta-Cluster-Infrastructure, 10Reading-Infrastructure-Team-Backlog, 10WikimediaEditorTasks, and 2 others: Drop the `wikimedia_editor_tasks_entity_description_exists` table - https://phabricator.wikimedia.org/T226326 (10Marostegui) 05Open→03Resolved All done [06:24:40] 10DBA, 10MediaWiki-Database, 10Operations, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) @jcrespo you ok if I copy dewiki.logging into db1114? I would like to see the... [06:50:40] 10DBA, 10MediaWiki-Database, 10Operations, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) > @jcrespo you ok if I copy dewiki.logging into db1114 Sure, if you do it in it... [06:52:04] 10DBA, 10MediaWiki-Database, 10Operations, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) >>! In T193224#5285040, @jcrespo wrote: >> @jcrespo you ok if I copy dewiki.l... [07:07:17] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) I had a chat with @mark and we are considering this Q4 goal done: * The 13 eqiad hosts were racked installed and provisioned {T211613} {T211613} *... [07:07:22] 10DBA, 10MediaWiki-Database, 10Operations, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) Ping @Anomie We have temporarily setup db1114 with MariaDB 10.3 and load it with... [09:04:00] I may need help for a review [09:04:08] although it is not time-sensitive [09:05:20] sure, send the patch! [09:05:29] "it's complicated" [09:05:44] https://phabricator.wikimedia.org/P8658 [09:06:05] I will check in a bit [09:06:15] let me generate pretty files first [09:06:18] ok [09:06:22] will ping for a review [09:06:33] good, thanks! [09:57:13] https://phabricator.wikimedia.org/P8658 and https://phabricator.wikimedia.org/P8659 can be checked [09:57:33] there is no groups yet, I wonder if we shoudl add those to zarcillo [09:57:50] as group doesn't really correlate perfectly with sections [09:58:17] e.g. an s1 hots could be on "labs (sanitarium)" [09:58:25] or on dbstore [10:01:40] can you give me some more context? [10:01:59] so prometheus has several "groups" [10:02:09] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1 [10:02:30] core, dbstore, labs, misc, parsercache [10:02:39] yeo [10:02:42] those are useful for discovery reasons (aggregation) [10:02:53] maybe those sould also be on the database [10:04:16] e.g. assuming groups are exclusive, instance table sould have a group property [10:04:36] so we can differenciate core, misc, labsdb hosts [10:07:46] And you want to use that script to generate those? [10:08:31] I want to generate prometheus config files [10:08:43] ah the ones we have to edit manually [10:08:45] nice [10:09:21] the thing is I've seem some problems [10:09:26] sections without a master [10:09:29] and some with 2 masters [10:09:43] I need help to double check the data is the same as the current one and truw [10:11:07] which one has to masters? I cannot see that one line [10:11:40] 10DBA, 10Operations, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) 05Stalled→03Open a:03jcrespo [10:12:13] https://phabricator.wikimedia.org/P8659$125 [10:12:19] no master on s4 ^ [10:12:24] Ah I was checking eqiad [10:12:41] https://phabricator.wikimedia.org/P8659$165 ^2 masters on s2 [10:12:47] interesting, s4 eqiad works but s4 codfw fails [10:12:58] ? [10:13:04] s4 eqiad does have a master [10:13:16] what do you mean? [10:13:26] that the generation for eqiad worked for s4 [10:13:29] I don't think it is a programming error [10:13:32] and failed for codfw [10:13:34] it is a data error [10:13:36] I am not saying it is [10:13:49] zarcillo is likely to be wrong [10:14:20] So I failed over s4 codfw a few weeks ago [10:14:39] oh, so you mean switchover.py failed? [10:14:43] let me see [10:14:52] No, s4 codfw wasn't done with the script [10:14:54] | s4 | codfw | db2051 | [10:15:07] db2051 is no longer s4 master [10:15:15] I forgot to update zarcillo for codfw [10:15:43] np, that is 1) why I want your help to review it [10:15:50] 2) automate it soon [10:16:06] let me review all codfw masters just in case [10:16:23] we could also do a diff [10:16:32] if I implement the group think on the database [10:16:40] *thing [10:19:25] if you are ok, I will implement the grouping [10:19:36] yeah [10:19:38] that sounds good [10:19:43] Let me fix the masters in codfw though [10:19:48] sure [10:19:53] doesn't have to be now [10:20:02] I am not in a hurry [10:20:07] shouldn't take long, I have all the moves noted :) [10:20:31] but if we do this, we can consider zarcillo as canonical from no own [10:20:36] *now on [10:21:00] and stop maintaining it on 30 different places [10:21:08] that'd be awesome [10:22:35] masters in codfw should be good now [10:22:38] can you re-run it? [10:22:43] sure [10:23:32] reload the paste [10:24:36] that looks better [10:24:44] cool, thanks! [10:25:06] thank you! [10:25:17] maybe I can add quick scripts to both switchover and add host [10:25:39] GET ALL THE AUTOMATION [10:25:55] haha [10:26:29] will continue working on this, may request help and reviews at a later time [10:26:38] sure! happy to do so [10:58:36] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Ladsgroup) Regarding Cognate going read-only, I want to point out to T187960#4998807 (I can run the maintenance script after it's d... [11:28:10] another difference, we should check db1077 on prometheus/zarcillo [11:29:15] db1121 is missing from prometheus, I think [11:30:19] and should not be on s5 [11:34:58] and for some reason db2101:13320 appears on eqiad , but not sure why [11:36:20] ah, because server == db2101.eqiad.wmnet [11:43:18] tools dbs are missing from prometheus [11:45:04] db1135 for m5 is wrong on prometheus [12:05:41] 10DBA, 10Epic: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [12:05:48] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) 05Open→03Resolved a:03jcrespo 4th time in a row with 0 failures, this is done. [12:06:43] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) a:03jcrespo [12:30:34] I will fix those issues [12:31:04] the toolsdb I don't think we ever moved them to zarcillo after labs1005 and 1004 were decommed [13:08:29] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519223/ [13:11:50] fixed zarcillo for db2102 [13:12:01] (and checked for similar ones) [13:12:04] on the servers table [13:16:01] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) @Ladsgroup I believe that last time it wasn't necessary, but I am not 100% sure [13:39:20] just FYI, I think I caused a minor traffic spike on the s8 master: https://w.wiki/5KP [13:39:29] feel free to raise the priority of https://phabricator.wikimedia.org/T226635 if you think it’s important [13:40:19] to me it looks harmless for now [13:40:40] Lucas_WMDE: yeah, it doesn't look like a big deal [13:41:02] ok good :) [13:41:11] Lucas_WMDE: it is supposed not to happen again right? [13:41:16] I meant it was just that one [13:41:18] no? [13:41:35] we have no concrete plans for running that maintenance script again [13:41:40] might become necessary at some point [13:41:51] but we can leave the improvement until then [13:42:20] I think the last time we ran it was over a year ago: https://tools.wmflabs.org/sal/production?p=0&q=ImportConstraintStatements.php&d= [13:42:23] "The maintenance script currently imports constraint statements for all properties in a simple loop, with no offset, limit, batching, or sleeping" -> that would be nice to get fixed before it runs again [13:44:42] 10DBA, 10MediaWiki-API, 10Performance: list=logevents slow for users with last log action long time ago - https://phabricator.wikimedia.org/T71222 (10Marostegui) I wanted to test this issue with 10.3 on db1114. I copied `logging` `page` and `user` tables from `dewiki` from one of the hosts that have the weir... [16:00:17] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Ladsgroup) >>! In T226358#5286014, @Marostegui wrote: > @Ladsgroup I believe that last time it wasn't necessary, but I am not 100%... [16:02:10] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) >>! In T226358#5286495, @Ladsgroup wrote: >>>! In T226358#5286014, @Marostegui wrote: >> @Ladsgroup I believe that last... [17:19:33] jynus / marostegui Hola. How's going? Is that DB change already applied? [17:20:45] we need to ask anomie, would you mind pinging him on the ticket? [17:20:57] sure [17:20:59] to check if it is already done or there is a blocker [17:21:17] I'll login to phab [17:56:53] Seems labsdb1010 is running at a load avg of over 30, and replag is crazy on the wiki replicas. Is anything going on...or is this just a matter of wondering how to stop folks from thrashing the web replicas. The analytics replica is barely being touched. [17:57:37] I don't know, I know manuel may be running maintenance on some host [17:57:40] bstorm 1011 is depooled for maintenance [17:57:47] I was planning to pool it back tomorrow [17:57:48] That explains why it is quiet :) [17:57:59] Perhaps that is why 1010 is so busy as well [17:58:03] yeah [17:58:09] most likely [17:58:12] the problems is there is no way to avoid it [17:58:17] either death by load [17:58:18] Ok, cool. Any thoughts on the 40 hours or so of lag? [17:58:23] but we have to get that maintenance done, or else we will hit disk space problems :( [17:58:27] or death by disk space [17:58:37] Yeah :) [17:58:44] That's a hard one to fix [17:58:46] death everywhere [17:58:55] 🧟‍♀️ [17:59:01] the 40h lag is probably because it is getting hammered heavily [17:59:03] bstorm_: start killing random long processes [17:59:13] or lower the log query time [17:59:22] *long [17:59:28] once I pool 1011 it should also decrease I guess [17:59:40] or move away quarry? [17:59:48] It's only s4 and s8 [17:59:50] those are some suggestions [17:59:53] lol [18:00:02] Fair [18:00:06] no, the last one is reasonable [18:00:14] quarry is bound to web or analytics [18:00:23] Ah ok [18:00:25] I see [18:00:28] if the one used for quarry is overloaded [18:00:29] bstorm_: yeah, keep in mind that s1,s4 and s8 are the heaviest sections [18:00:35] it can be a way to play with it [18:00:51] the second one is reasonable too [18:01:14] if we can only attend some queries, lower temporarilly the long running ones [18:01:27] Ok :) [18:01:37] Thanks! [18:06:07] you can also switch s4 and s8 only on the config