[06:23:00] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es1014.eqiad.wmnet` - es1014.eqiad.wmnet (**FAIL**) - Downtimed host on Icinga... [06:53:52] 10Blocked-on-schema-change, 10DBA: Drop default of protected_titles.pt_expiry - https://phabricator.wikimedia.org/T267335 (10Marostegui) [06:53:59] 10Blocked-on-schema-change, 10DBA: Drop default of ip_changes.ipc_rev_timestamp - https://phabricator.wikimedia.org/T267399 (10Marostegui) [07:38:09] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) >>! In T267090#6629470, @Marostegui wrote: > Attempting the same on s6 with: > > Running a check on s6 tables on db1125 > > clouddb1015:3316 `innodb_change_buffe... [07:40:23] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [09:18:01] tendril acting weirdly again - checking [09:39:52] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) >>! In T267090#6629364, @Marostegui wrote: > This is very bad news. > clouddb1013:3311 (s1) and clouddb1017:3311 (s1) crashed at the same time with the same error... [09:50:29] 10DBA, 10Wikimedia-General-or-Unknown, 10Security: Move private wikis to a dedicated cluster - https://phabricator.wikimedia.org/T101915 (10Aklapper) Thanks for the explanation! Does that mean this task should be declined? Or be open with lowest priority? (Asking as tasks shouldn't have "stalled" status for... [09:52:17] 10DBA, 10Wikimedia-General-or-Unknown, 10Security: Move private wikis to a dedicated cluster - https://phabricator.wikimedia.org/T101915 (10Marostegui) I would go for the decline, I don't think we have such an amount of private wikis that could justify having more hardware just for them (with the same produc... [10:27:01] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) >>! In T267090#6632810, @Marostegui wrote: >>>! In T267090#6629470, @Marostegui wrote: >> Attempting the same on s6 with: >> >> Running a check on s6 tables on db... [10:27:19] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [10:28:16] I am going to restart tendril [10:29:59] +1 [10:31:16] so how far away do you think it is a read-only, internal-only, preliminary setup of orchestrator? 0:-) [10:32:30] I think it will be read-only for a long time, but start using it, I think maybe by the end of the next Q or so [10:32:38] It depends on how many things we find along the way [10:32:40] cool [10:32:49] yeah, I can guess [10:32:55] It also depends on how fast we can enable report_host [10:33:04] We have decided to make it a hard blocker for adding new topologies [10:33:25] So in order to add a topology, it has to have report_host enabled on 100% of all the hosts [10:33:56] wait, but if it will only affect replicas, one may be able to add them without the master? [10:34:05] as in, without restarting masters? [10:34:19] for read only, of course [10:34:41] No, the whole way orchestrator handles DNS, fqdn and whatnot is a pain, so we want it to be enabled everywhere, including masters [10:35:02] We have to replace masters anyways during the next Q, for refresh [10:35:18] Tendril is back [10:35:20] report_host is responsible for providing the host name, right? [10:35:33] So that we don't have to depend on DNS [10:35:48] It is responsible for showing up the right host once we discover the hosts via show slave hosts, as in: to show the replication topology [10:36:09] Thanks [10:36:23] There are a few ways for orchestrator to discover and treat hostnames [10:36:57] But we need it to use FQDN, otherwise, when configuring replication, it will fail as it will make the slave connect to dbXXXX rather than dbXXXX.eqiad.wmnet and hence it won't start replication [10:37:36] marostegui: sorry to insist, of course we need it to enable everywhere [10:37:47] And then Stevie looked at the internals to see how to deal with DNS and had so much fun with it [10:38:01] not against it, but my question is- wouldn't it work for masters not connecting somewhere? [10:38:21] or does it check the variable, not only "show slave hosts"? [10:38:36] jynus: Yes, I guess it would for the reporting slaves, but we don't want to start like that [10:38:50] ok, I understand now [10:58:58] is the db1115 maintenance over? generate-mysqld-exporter-config.service on the prometheus* failed with a traceback since it couldn't connect to db1115 [11:13:42] it was rebooted [11:14:06] reset-fail it moritzm, or I can do it, it should work now [11:15:06] is it prometheus2003? [11:16:58] earlier it was all four of them, but maybe there's a retry or so going on [11:17:10] well, it is a cron [11:17:10] right now it seems only 2003 is left in fact [11:17:15] ok [11:17:54] we should only worry if it fails all the time- and it doesn't affect monitoring at all, just the changes to monitoring hosts [11:18:18] plus see SAL log for why if failed (tendril restart) [11:18:28] 10DBA, 10Orchestrator: Investigate hostname/fqdn handling in orchestrator - https://phabricator.wikimedia.org/T267929 (10Kormat) [11:19:47] many of those systemd.timers may need a review on alerting- I would like to know asap when happen, but they are non-issues for 1 failure that general ops should not worry about [11:19:56] jynus: we need @@report_host regardless of where in a topology an instance is. without it orchestrator will use the bare hostname for the node. https://phabricator.wikimedia.org/T267929#6626839 has some details. [11:20:43] thanks, kormat [11:26:09] any suggestion about the systemd timer alerting management I mentioned? [11:27:18] we may have to live with it until we have something better than icinga, I will comment it to the observability team [11:27:21] i really have no idea about systemd timers [11:27:35] well, it is not about that, it is mostly about crons failing now alerting [11:27:44] when they didn't before [11:27:48] which is both a win [11:27:55] but may create alert spam [11:28:14] typically you deal with this by saying that don't alert until there hasn't been a successful in X amount of time [11:28:27] which doesn't really seem like an icinga approach, indeed [11:28:32] yeah [11:28:58] we could maybe modify the systemd checker to have extra time, but may not work with mixed errors [11:29:16] e.g. if systemd sees mysql down, we want to alert ASAP [11:29:36] but it "a backup failed" we already have monitoring on backups freshness, we don't care too much [12:00:04] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es1011.eqiad.wmnet` - es1011.eqiad.wmnet (**FAIL**) - Downtimed host on Icinga... [12:01:08] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10Marostegui) [12:20:29] * sobanski stepping away to do some grocery shopping [12:22:39] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) So, enwiki transfer from db1124:3311 to clouddb1013 and clouddb1017 finished. And as soon as I started mysql on them, they returned errors. So I am going to go the... [12:45:39] 10DBA, 10Orchestrator: Orchestrator doesn't use FQDN when manipulating replicas - https://phabricator.wikimedia.org/T267389 (10Kormat) 05Open→03Resolved a:03Kormat Tested a simple case in pontoon, works correctly ` kormat@mariadb104-test1:~$ sudo orchestrator -c relocate -i slave1 -d zarcillo1 2020-11-1... [12:46:28] 10DBA, 10Orchestrator: Orchestrator doesn't use FQDN when manipulating replicas - https://phabricator.wikimedia.org/T267389 (10Marostegui) \o/ [13:58:41] marostegui: for P13333 I've a couple of questions when you have a sec [13:58:52] sure [13:59:07] the last comment is the second time you run the cookbook for es1014? [13:59:47] volans: no, it is for a new host, es1011 [14:00:07] ok [14:00:42] let me check a couoplf of things [14:00:53] marostegui: and did you run it again for es1014? [14:01:02] yes, it went fine! [14:02:01] ahhh okok [14:02:05] this is good [14:06:27] 10Blocked-on-schema-change, 10DBA: Drop default of protected_titles.pt_expiry - https://phabricator.wikimedia.org/T267335 (10Marostegui) [14:06:34] 10Blocked-on-schema-change, 10DBA: Drop default of ip_changes.ipc_rev_timestamp - https://phabricator.wikimedia.org/T267399 (10Marostegui) [14:11:05] marostegui: can I re-run it on es1011? I'm having a hard time to understand why it failed as the stacktrace is exactly the same of the issue I fixed yesterday and it must be someting alse, at leastly slightly [14:11:15] yeah, go for it [14:11:21] volans: the host is powered off though [14:11:26] like, the rest of the stuff went well [14:11:56] volans: I still have another host to decommission I was planning on doing tomorrow [14:12:55] ok [14:43:19] marostegui: ok if I remove the records for es1011? [14:43:29] the mgmt [14:43:29] elukey: yep [14:43:52] elukey: that's my fault, I tried to manually update the status from the api to see if the error could be reproduced [14:43:55] and I couldn't [14:43:57] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `es1011.eqiad.wmnet` - es1011.eqiad.wmnet (**FAIL**) - Downtimed host on Icinga -... [14:44:03] buuuuu [14:44:07] :D [14:44:07] I 'm running now the decom cookbook but you beat me to it [14:44:10] for the dns change [14:44:14] sorryyy [14:44:33] Two decoms are better than one? [14:44:42] yes, being idempotent :D [14:45:15] Two decoms are exactly the same as one, then, to be correct [14:45:33] not if the first one fails as before :) [14:46:04] the the strange part is that I can't repro what failed on manuel's case and that exact same error was fixed 2 days ago by a patch [14:46:09] so I'm a bit puzzled at the moment [14:46:19] volans: Any chances I was using something older? [14:47:01] the stacktrace reports the correct line so I doubt it [14:47:11] but luckily elukey has a bunch of hosts to decom [14:47:19] so we'll know soon eough if it's a race or what [14:47:50] Sure, and if you want to try tomorrow, we can decom the other one together [14:51:13] sure, but no need to wait for me, you can do it anytime, I have all the logs on both sides [14:51:16] fwiw [14:51:28] volans: cool, can I then move the es1011 task to dcops to proceed? [14:51:46] yep, all done on our side, decom re-run and completed the missing step [14:51:50] from previous run [14:51:51] cool! [14:52:01] sorry for trouble [14:52:46] not at all! [17:06:05] PROBLEM - MariaDB sustained replica lag on db1098 is CRITICAL: 21 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1098&var-port=13317 [17:07:09] RECOVERY - MariaDB sustained replica lag on db1098 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1098&var-port=13317 [20:21:14] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) @jcrespo Replaced Dimms per hp.