[06:16:02] 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) Running schema change on s3 - will take around 15h [06:51:56] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) [09:10:03] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:24:50] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) [09:58:32] 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1078.eqiad.wmnet` - db1078.eqiad.wmnet (**PASS**) - Downtimed host on I... [09:58:46] 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) a:05Marostegui→03wiki_willy [09:58:58] 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) @wiki_willy this is ready for #dc-ops [10:00:07] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:22:21] 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10Marostegui) [10:22:32] 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10Marostegui) p:05Triage→03Medium [11:08:49] 10DBA, 10Orchestrator, 10User-Kormat: Enable communication between orchestrator and clouddb hosts - https://phabricator.wikimedia.org/T273606 (10Marostegui) [12:22:09] I found the source of connection problems, and why they aren't logged on error log [12:22:22] which connection problem? [12:22:39] the small rate of connection errors we get on prometheus on some servers [12:22:48] ah, what is it? [12:23:00] it is 2 options, all related to missing grants [12:23:14] they are not logged because it only logs connection failures [12:23:25] but not permission errors once logged in [12:23:47] either lack of pt-heartbeat table grants (or complete lack of the table/db [12:23:57] or lack of ops db (and its events/table) [12:24:22] e.g. maybe on recovery, one adds the grants to heartbeat [12:24:32] but if the table is missing, the grant will fail [12:24:55] and that is why I think it may be happening on es* hosts- some of the read only ones will lack a heartbeat table [12:25:13] and some scripts will assume it is there (checks/monitoring) [12:25:26] could be, so a dummy heartbeat table might work [12:25:31] yeah [12:25:36] and its grants [12:25:48] see how it got solved here: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&from=1612437941285&to=1612441541285&var-server=db1171&var-port=13313 [12:26:04] after I added heartbeat and ops dbs I missed [12:26:58] are you working with db1171 at the moment? [12:26:59] ah yes [12:27:02] it is the replacement of db1095 [12:27:10] it showed up on icinga as lag and I got scared [12:27:16] I am going to downtime it for 24h, is that ok? [12:27:22] I just acked it [12:27:25] ah ok [12:27:25] downtime expired [12:27:31] yep, no problem [12:27:36] do you want me to downtime it 24h? [12:27:47] no need, I acked it until it gets fixed [12:27:59] cool [12:28:15] so regarding the connection problems, maybe it is heartbeat indeed. I wouldn't expect things to use ops database [12:28:32] well, not outside things, but events can cause permission problems [12:28:41] but yeah, more likely to be heartbeat [12:29:06] it is either icinga/prometheus or the events, given it happened every 10 seconds or so [12:29:13] if you want to create a task for it, I can double check it tomorrow or next week [12:29:40] the thing is, icinga is smart and if heartbeat is unavailable it uses replication status [12:29:51] so that may obscure it [12:30:17] I think the metrics being on the graph will make things more apparent, and knowing the causes will be easy to fix [12:30:29] I don't think is a huge issue ATM [12:31:52] although now that I see: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=1612420286309&to=1612441886309&var-site=eqiad&var-group=core&var-shard=All&var-role=All maybe it is more generalized? [12:32:49] I think for this point a ticket would be worth it: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=1611849976206&to=1611888509170&var-site=eqiad&var-group=core&var-shard=All&var-role=All [12:33:14] probably it is something like clouddb or something multiinstance recently setup [12:33:31] but not everything was set up at the same time [12:33:40] labsdb? [12:34:01] not too worried, and now we have the tools (extra metric) to track it [12:34:35] although I think the puppet change from brooke for the pt-hearbeat on cloudb hosts was merged around that time and maybe it was run manually [12:36:01] marostegui, you seem to be right: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&var-server=clouddb1013&var-port=13311&from=1611850218345&to=1611886058345 [12:36:25] so I think it is ok, the metric I added was useful for debugging :-) [12:36:34] I think that it might be the query-killer [12:36:46] ah, true [12:37:08] let me stop it on clouddb1013:3311 and let's see [12:37:13] wait, then why is it clasiffied on core hosts? [12:37:31] mm, maybe clouddb hosts are as "core" on zarcillo? [12:37:43] it shouldn't [12:38:07] I have stopped query killer on s1 clouddb1013 [12:38:09] we'll see [12:38:09] but yeah, something to check, we can create a ticket with all of this [12:38:24] not super-urgent, but to track it [12:38:34] | clouddb1013:3311 | clouddb1013.eqiad.wmnet | 3311 | NULL | NULL | core | [12:38:38] they are in core [12:38:41] ah [12:38:47] Let me fix that [12:38:52] they should be on labsdb or whatever is named [12:41:11] https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&var-server=clouddb1013&var-port=13311&from=now-5m&to=now [12:44:11] so it doesn't seem to be the query killer [12:44:50] I have updated clouddb hosts to be part of labsdb in zarcillo btw [12:45:04] I think that is more important^ [12:45:21] let me force-run prometheus targets update [12:46:33] looks good now [12:47:24] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=9&orgId=1&var-site=eqiad&var-group=labs&var-shard=All&var-role=All&from=1612439241857&to=1612446441858 [12:47:36] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Technical-Debt, and 2 others: Make wb_changes_dispatch.chd_seen unsigned in production - https://phabricator.wikimedia.org/T273874 (10Lucas_Werkmeister_WMDE) [12:48:34] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Technical-Debt, and 2 others: Make wb_changes_dispatch.chd_seen unsigned in production - https://phabricator.wikimedia.org/T273874 (10Lucas_Werkmeister_WMDE) [12:49:02] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=clouddb1014&var-port=13312 that message is weird [12:49:06] the one about the QPS [12:49:28] indeed [12:49:34] oh, I can see why [12:49:56] for some period of time, there could be to clouddbs1014, the "old" and the "new" [12:50:08] *two [12:50:29] should go away soon [12:50:51] see how it changed colors on all graphs [12:51:42] yeah [12:51:46] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Technical-Debt, and 2 others: Make wb_changes_dispatch.chd_seen unsigned in production - https://phabricator.wikimedia.org/T273874 (10Marostegui) p:05Triage→03Medium [12:52:02] QPS now works fine [12:52:51] is core aggregated fine now? [12:53:13] yeah, seems so, almost 0 errors [12:53:31] so I propose you something- I need to do some changes on read only es1 grants for backups [12:53:37] so I can research those [12:53:50] and you can check/talk to cloud clouddb, as you were helping them with those? [12:53:56] *about [12:54:47] at your own pace- I don't thing this is high priority ofc [12:55:03] *think [12:55:29] and we can document my research of potential causes somewhere (heartbeat/ops) [12:56:13] as another todo, I can add, if you think is useful, clouddb to the Top % memory usage dashboard exclusion list [12:57:23] I will also double check Errors on the aggregated panel is right, there could be some bugs (I only checked the single-instance one) [12:59:09] I am happy new metrics => better debugging :-D [13:00:08] yeah, I can talk to them but definitely not something that will happen soon, I have many things with higher priority. If you document what you found I can use that for clouddb and see if it might be the same thing [13:00:14] of course! [13:00:31] I will try to document what I find [13:00:37] sounds good cheers [13:16:46] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1157.eqiad.wmnet'] ` The log ca... [13:38:03] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1157.eqiad.wmnet'] ` and were **ALL** successful. [13:47:19] marostegui: are you working on m1? i'm seeing a bunch of lag: https://orchestrator.wikimedia.org/web/cluster/alias/m1 [13:48:20] kormat: nope [13:48:23] let's see [13:48:40] there's a massive delete from bacula [13:48:57] it's been running for 5 minutes [13:49:24] https://phabricator.wikimedia.org/P14206 [13:56:42] kormat: so the top bar of the host box goes orange, that's how we know. And admittedly, it is a slightly different shade of orange :) [13:57:07] sobanski: haha, right :)