[06:16:02] <wikibugs>	 10Blocked-on-schema-change: Schema change for renaming two indexes of site_identifiers - https://phabricator.wikimedia.org/T273361 (10Marostegui) Running schema change on s3 - will take around 15h
[06:51:56] <wikibugs>	 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui)
[09:10:03] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[09:24:50] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui)
[09:58:32] <wikibugs>	 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1078.eqiad.wmnet` - db1078.eqiad.wmnet (**PASS**)   - Downtimed host on I...
[09:58:46] <wikibugs>	 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) a:05Marostegui→03wiki_willy
[09:58:58] <wikibugs>	 10DBA, 10DC-Ops, 10decommission-hardware, 10ops-eqiad: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) @wiki_willy this is ready for #dc-ops
[10:00:07] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:22:21] <wikibugs>	 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10Marostegui)
[10:22:32] <wikibugs>	 10DBA, 10Orchestrator: Investigate a way to make the anonymized version of Orchestrator open to replace dbtree - https://phabricator.wikimedia.org/T273863 (10Marostegui) p:05Triage→03Medium
[11:08:49] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable communication between orchestrator and clouddb hosts - https://phabricator.wikimedia.org/T273606 (10Marostegui)
[12:22:09] <jynus>	 I found the source of connection problems, and why they aren't logged on error log
[12:22:22] <marostegui>	 which connection problem?
[12:22:39] <jynus>	 the small rate of connection errors we get on prometheus on some servers
[12:22:48] <marostegui>	 ah, what is it?
[12:23:00] <jynus>	 it is 2 options, all related to missing grants
[12:23:14] <jynus>	 they are not logged because it only logs connection failures
[12:23:25] <jynus>	 but not permission errors once logged in
[12:23:47] <jynus>	 either lack of pt-heartbeat table grants (or complete lack of the table/db
[12:23:57] <jynus>	 or lack of ops db (and its events/table)
[12:24:22] <jynus>	 e.g. maybe on recovery, one adds the grants to heartbeat
[12:24:32] <jynus>	 but if the table is missing, the grant will fail
[12:24:55] <jynus>	 and that is why I think it may be happening on es* hosts- some of the read only ones will lack a heartbeat table
[12:25:13] <jynus>	 and some scripts will assume it is there (checks/monitoring)
[12:25:26] <marostegui>	 could be, so a dummy heartbeat table might work
[12:25:31] <jynus>	 yeah
[12:25:36] <jynus>	 and its grants
[12:25:48] <jynus>	 see how it got solved here: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&from=1612437941285&to=1612441541285&var-server=db1171&var-port=13313
[12:26:04] <jynus>	 after I added heartbeat and ops dbs I missed
[12:26:58] <marostegui>	 are you working with db1171 at the moment?
[12:26:59] <marostegui>	 ah yes
[12:27:02] <marostegui>	 it is the replacement of db1095
[12:27:10] <marostegui>	 it showed up on icinga as lag and I got scared
[12:27:16] <marostegui>	 I am going to downtime it for 24h, is that ok?
[12:27:22] <jynus>	 I just acked it
[12:27:25] <marostegui>	 ah ok
[12:27:25] <jynus>	 downtime expired
[12:27:31] <marostegui>	 yep, no problem
[12:27:36] <marostegui>	 do you want me to downtime it 24h?
[12:27:47] <jynus>	 no need, I acked it until it gets fixed
[12:27:59] <marostegui>	 cool
[12:28:15] <marostegui>	 so regarding the connection problems, maybe it is heartbeat indeed. I wouldn't expect things to use ops database
[12:28:32] <jynus>	 well, not outside things, but events can cause permission problems
[12:28:41] <jynus>	 but yeah, more likely to be heartbeat
[12:29:06] <jynus>	 it is either icinga/prometheus or the events, given it happened every 10 seconds or so
[12:29:13] <marostegui>	 if you want to create a task for it, I can double check it tomorrow or next week
[12:29:40] <jynus>	 the thing is, icinga is smart and if heartbeat is unavailable it uses replication status
[12:29:51] <jynus>	 so that may obscure it
[12:30:17] <jynus>	 I think the metrics being on the graph will make things more apparent, and knowing the causes will be easy to fix
[12:30:29] <jynus>	 I don't think is a huge issue ATM
[12:31:52] <jynus>	 although now that I see: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=1612420286309&to=1612441886309&var-site=eqiad&var-group=core&var-shard=All&var-role=All maybe it is more generalized?
[12:32:49] <jynus>	 I think for this point a ticket would be worth it: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=1611849976206&to=1611888509170&var-site=eqiad&var-group=core&var-shard=All&var-role=All
[12:33:14] <jynus>	 probably it is something like clouddb or something multiinstance recently setup
[12:33:31] <marostegui>	 but not everything was set up at the same time
[12:33:40] <jynus>	 labsdb?
[12:34:01] <jynus>	 not too worried, and now we have the tools (extra metric) to track it
[12:34:35] <marostegui>	 although I think the puppet change from brooke for the pt-hearbeat on cloudb hosts was merged around that time and maybe it was run manually
[12:36:01] <jynus>	 marostegui, you seem to be right: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&var-server=clouddb1013&var-port=13311&from=1611850218345&to=1611886058345
[12:36:25] <jynus>	 so I think it is ok, the metric I added was useful for debugging :-)
[12:36:34] <marostegui>	 I think that it might be the query-killer
[12:36:46] <jynus>	 ah, true
[12:37:08] <marostegui>	 let me stop it on clouddb1013:3311 and let's see
[12:37:13] <jynus>	 wait, then why is it clasiffied on core hosts?
[12:37:31] <marostegui>	 mm, maybe clouddb hosts are as "core" on zarcillo?
[12:37:43] <jynus>	 it shouldn't
[12:38:07] <marostegui>	 I have stopped query killer on s1 clouddb1013
[12:38:09] <marostegui>	 we'll see
[12:38:09] <jynus>	 but yeah, something to check, we can create a ticket with all of this
[12:38:24] <jynus>	 not super-urgent, but to track it
[12:38:34] <marostegui>	 | clouddb1013:3311    | clouddb1013.eqiad.wmnet           | 3311 | NULL    | NULL                | core        |
[12:38:38] <marostegui>	 they are in core
[12:38:41] <jynus>	 ah
[12:38:47] <marostegui>	 Let me fix that
[12:38:52] <jynus>	 they should be on labsdb or whatever is named
[12:41:11] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=10&orgId=1&var-server=clouddb1013&var-port=13311&from=now-5m&to=now
[12:44:11] <marostegui>	 so it doesn't seem to be the query killer
[12:44:50] <marostegui>	 I have updated clouddb hosts to be part of labsdb in zarcillo btw
[12:45:04] <jynus>	 I think that is more important^
[12:45:21] <jynus>	 let me force-run prometheus targets update
[12:46:33] <jynus>	 looks good now
[12:47:24] <jynus>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=9&orgId=1&var-site=eqiad&var-group=labs&var-shard=All&var-role=All&from=1612439241857&to=1612446441858
[12:47:36] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Technical-Debt, and 2 others: Make wb_changes_dispatch.chd_seen unsigned in production - https://phabricator.wikimedia.org/T273874 (10Lucas_Werkmeister_WMDE)
[12:48:34] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Technical-Debt, and 2 others: Make wb_changes_dispatch.chd_seen unsigned in production - https://phabricator.wikimedia.org/T273874 (10Lucas_Werkmeister_WMDE)
[12:49:02] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=clouddb1014&var-port=13312 that message is weird
[12:49:06] <marostegui>	 the one about the QPS
[12:49:28] <jynus>	 indeed
[12:49:34] <jynus>	 oh, I can see why
[12:49:56] <jynus>	 for some period of time, there could be to clouddbs1014, the "old" and the "new"
[12:50:08] <jynus>	 *two
[12:50:29] <jynus>	 should go away soon
[12:50:51] <jynus>	 see how it changed colors on all graphs
[12:51:42] <marostegui>	 yeah
[12:51:46] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Technical-Debt, and 2 others: Make wb_changes_dispatch.chd_seen unsigned in production - https://phabricator.wikimedia.org/T273874 (10Marostegui) p:05Triage→03Medium
[12:52:02] <marostegui>	 QPS now works fine
[12:52:51] <jynus>	 is core aggregated fine now?
[12:53:13] <jynus>	 yeah, seems so, almost 0 errors
[12:53:31] <jynus>	 so I propose you something- I need to do some changes on read only es1 grants for backups
[12:53:37] <jynus>	 so I can research those
[12:53:50] <jynus>	 and you can check/talk to cloud clouddb, as you were helping them with those?
[12:53:56] <jynus>	 *about
[12:54:47] <jynus>	 at your own pace- I don't thing this is high priority ofc
[12:55:03] <jynus>	 *think
[12:55:29] <jynus>	 and we can document my research of potential causes somewhere (heartbeat/ops)
[12:56:13] <jynus>	 as another todo, I can add, if you think is useful, clouddb to the Top % memory usage dashboard exclusion list
[12:57:23] <jynus>	 I will also double check Errors on the aggregated panel is right, there could be some bugs (I only checked the single-instance one)
[12:59:09] <jynus>	 I am happy new metrics => better debugging :-D
[13:00:08] <marostegui>	 yeah, I can talk to them but definitely not something that will happen soon, I have many things with higher priority. If you document what you found I can use that for clouddb and see if it might be the same thing
[13:00:14] <jynus>	 of course!
[13:00:31] <jynus>	 I will try to document what I find
[13:00:37] <marostegui>	 sounds good cheers
[13:16:46] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1157.eqiad.wmnet'] ` The log ca...
[13:38:03] <wikibugs>	 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1157.eqiad.wmnet'] `  and were **ALL** successful.
[13:47:19] <kormat>	 marostegui: are you working on m1? i'm seeing a bunch of lag: https://orchestrator.wikimedia.org/web/cluster/alias/m1
[13:48:20] <marostegui>	 kormat: nope
[13:48:23] <marostegui>	 let's see
[13:48:40] <marostegui>	 there's a massive delete from bacula
[13:48:57] <marostegui>	 it's been running for 5 minutes
[13:49:24] <marostegui>	 https://phabricator.wikimedia.org/P14206
[13:56:42] <sobanski>	 kormat: so the top bar of the host box goes orange, that's how we know. And admittedly, it is a slightly different shade of orange :)
[13:57:07] <kormat>	 sobanski: haha, right :)