[05:42:49] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10Marostegui) Pre restart steps done [06:06:11] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10Marostegui) This was done. RO time for phabricator was around: 06:01:53 ON 06:03:12 OFF Thanks @mmodell for the help! ` root@db1132.eqiad.wmnet[(none)]> select @@report_host; +---------------... [06:06:13] 10DBA, 10Orchestrator: Add m* sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) [06:06:15] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [06:06:35] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10Marostegui) 05Open→03Resolved a:03Marostegui [06:06:57] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [06:10:35] 10DBA, 10Orchestrator: Add m* sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) m3 is now in orchestrator [06:10:42] 10DBA, 10Orchestrator: Add m* sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) [06:36:07] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) m5 cleaned [06:38:32] 10DBA, 10wikitech.wikimedia.org, 10User-notice, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) Procedure: Pre restart [] Silence m5 hosts [] buffer pool dump + disablement in advance to make the restart faster Restart [] `!log m5 ma... [06:48:46] 10DBA, 10Platform Engineering Roadmap Decision Making, 10SRE, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) Thanks @Krinkle - I will probably start first with s6 codfw (frwiki,jawiki,ruwiki), and using wikimediadebug to... [07:16:41] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [07:16:52] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) clouddb1019:3316 moved under db1155:3316 [07:34:03] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:12:09] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [08:13:50] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) clouddb1019:3314 moved under db1155:3314 All the new clouddb hosts are moved under the new 10.4 sanitariums. This task is now stalled - waiting on... [10:17:55] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:23:42] 10DBA, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) [10:24:09] 10DBA, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) Not ready until monday [10:24:23] 10DBA, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) [10:24:26] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:24:42] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:39:33] hey, kormat any strong thoughts about https://gerrit.wikimedia.org/r/c/operations/puppet/+/657820 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/657801 to vote keep or remove? [10:40:16] re: 657820, nuke away [10:42:00] re: 657801, seems useful [10:43:59] so I don't feel also strongly about it either but 2 reasoning happened: if we go back to use lvm backups, we should implement them as part of wmfbackups, and we cannot really implement it witout modifying our partitioning [10:44:27] so more of a model than a refactoring issue [10:44:47] I will wait anyway, comment on it with any suggestion on patch [10:45:35] i'm by default in favour of removing stuff from puppet that we're not using. it's always available in the git history if we need to dig it up again [10:45:49] yeah, that was manuel's thought too [10:46:31] I think my, (not very strong) compass here was "how likely we are to use it again" [10:51:50] (ah - i think you linked to the wrong CR above. https://gerrit.wikimedia.org/r/c/operations/puppet/+/657821 is the one for removing mylvmbackup. i've +1'd it) [10:52:01] oh, sorry [10:52:06] what did I link [10:52:13] the backup grants [10:52:22] oh, sorry [10:52:26] indeed I meant the other [11:00:55] I found another obsolete thing on the mariadb package, sending patch soon [11:07:00] I think today is going to be cleanup day, first this, now the bacula one [11:12:11] 10Data-Persistence-Backup, 10DC-Ops, 10SRE, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) a:03jcrespo [11:58:46] 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet - https://phabricator.wikimedia.org/T273049 (10jcrespo) [11:59:16] 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet - https://phabricator.wikimedia.org/T273049 (10jcrespo) [11:59:18] 10Data-Persistence-Backup, 10DC-Ops, 10SRE, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [12:00:14] 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10jcrespo) [12:01:16] 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10jcrespo) @robh This is not yet ready for dc-ops processing, but do we need a separate checklist for the system and the attached array, or one is enough? [12:01:32] 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10jcrespo) [12:05:38] 10Data-Persistence-Backup, 10decommission-hardware: decommission heze and its attached array - https://phabricator.wikimedia.org/T273051 (10jcrespo) [12:06:09] 10Data-Persistence-Backup, 10decommission-hardware: decommission heze and its attached array - https://phabricator.wikimedia.org/T273051 (10jcrespo) [12:06:12] 10Data-Persistence-Backup, 10DC-Ops, 10SRE, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [12:09:42] 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10jcrespo) a:03jcrespo [12:10:20] 10Data-Persistence-Backup, 10decommission-hardware: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10jcrespo) [12:46:11] 10DBA: Investigate using PMM (Percona Monitoring and Management) for slow-query analysis - https://phabricator.wikimedia.org/T273054 (10Kormat) [12:46:33] marostegui: so we don't forget about it ^ [12:48:37] 10DBA: Investigate using PMM (Percona Monitoring and Management) for slow-query analysis - https://phabricator.wikimedia.org/T273054 (10Marostegui) p:05Triage→03Medium [13:00:39] 10DBA: Investigate using PMM (Percona Monitoring and Management) for slow-query analysis - https://phabricator.wikimedia.org/T273054 (10jcrespo) {icon thumbs-up color=green} For history, we enabled a similar solution to this (through grafana + prometheus_mysqld_exporter- not sure if PMM uses that for queries or... [13:02:17] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10jcrespo) [13:05:59] 10DBA, 10SRE, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10jcrespo) This looks very related to T242571, but not merging because it is a topic very likely to evolve. [13:08:04] kormat, is T256845 and T257822 the same or are they just similar? [13:08:05] T256845: Add monitoring to ensure that puppet/tendril/zarcillo all agree on the set of sections that exist - https://phabricator.wikimedia.org/T256845 [13:08:05] T257822: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 [13:09:02] and T257814 and T242571 seem also very similar [13:09:05] T242571: Automatically populate tendril/zarcillo with the list of databases in the infrastructure - https://phabricator.wikimedia.org/T242571 [13:09:06] T257814: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 [13:11:53] not sure if there is a tendril-related epic, but we could use one to track all issues [13:13:42] 10DBA: Investigate using PMM (Percona Monitoring and Management) for slow-query analysis - https://phabricator.wikimedia.org/T273054 (10jcrespo) [13:13:46] 10DBA, 10SRE, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [13:15:13] 10DBA: Investigate using PMM (Percona Monitoring and Management) for slow-query analysis - https://phabricator.wikimedia.org/T273054 (10jcrespo) Adding T143896 epic, even if one can argue that "query monitoring" is metrics or not, but to link it to an epic where this need was mentioned. [13:18:56] 10DBA, 10SRE, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [13:20:20] 10DBA, 10SRE, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10Kormat) 05Open→03Resolved a:03Kormat Resolving this as tendril is going away. [13:20:26] 10DBA, 10SRE, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [13:23:52] feh. phab's approach for marking a task as depending on another leads to really confusing hierarchies. [13:24:01] +100000 [13:24:10] it is both a dependency AND a subtask [13:24:20] not clear at all [13:24:43] 10DBA, 10SRE, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [13:25:01] I am trying to generate a few epics for organization [13:25:17] T143896 for metrics monitoring related tasks [13:25:17] jynus: can you not, please? :) [13:25:19] T143896: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 [13:25:21] ok [13:25:34] not new ones [13:25:36] existing ones [13:25:50] i mean, for your own area, by all means, go with whatever suits you [13:26:03] but i prefer to keep epics fairly constrained for my stuff [13:26:10] ok [13:26:16] e.g. https://phabricator.wikimedia.org/T257814 is an epic, with a small number of subtasks [13:27:40] I just pointed that T257821 and T257822 looked very similar [13:27:40] T257821: Add monitoring to ensure consistency between puppet and zarcillo - https://phabricator.wikimedia.org/T257821 [13:27:40] T257822: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 [13:28:23] those particular 2 were intended to be like that [13:28:28] ok [13:28:35] i filed them at the same time [15:59:37] 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10RobH) >>! In T273049#6780019, @jcrespo wrote: > @robh This is not yet ready for dc-ops processing, but do we need a separate checklist for th... [16:00:30] 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10RobH) [16:00:45] 10Data-Persistence-Backup, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10RobH) [16:01:49] 10Data-Persistence-Backup, 10decommission-hardware, 10Patch-For-Review: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10jcrespo) [16:02:04] 10Data-Persistence-Backup, 10decommission-hardware, 10ops-codfw, 10Patch-For-Review: decommission heze and heze-array1 - https://phabricator.wikimedia.org/T273051 (10jcrespo) [18:18:55] 10Data-Persistence-Backup, 10SRE: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) Apparently, there is the following code on backup::set: ` $motd_content = "#!/bin/sh\necho \"Backed up on this host: ${name}\"" @motd::scrip... [21:58:11] 10DBA, 10MediaWiki-Cache: insert ignore into objectcache ignores stuff bigger than mediumblob - https://phabricator.wikimedia.org/T273117 (10Physikerwelt)