[00:08:34] FIRING: DiskSpace: Disk space thanos-be2006:9100:/ 1.746% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=thanos-be2006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:18:34] RESOLVED: DiskSpace: Disk space thanos-be2006:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=thanos-be2006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:19:00] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:00] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:51] Amir1, marostegui: when you have a second can we address https://phabricator.wikimedia.org/T422459#11832656 ? [08:04:07] I'll check in a bit [09:07:56] federico3: I don't see how that paste is related to that task. [09:10:32] there was also a warning during the maintain-views run that then disappeared minutes ago and I was wondering if the issue during the auto schema run can then impact maintain-views - anyhow I can move the auto schema glitch to the auto schema task [09:12:02] https://phabricator.wikimedia.org/T419635#11832765 [09:29:58] is it ok if I put section and role in each line in https://phabricator.wikimedia.org/T419961 ? [09:56:17] federico3: works for me, but I think those are generated with a script Amir1 has somewhere so it may get deleted on the next run? [09:58:17] I can try [09:58:33] sure! [10:37:27] I've created T423690 with some notes about the thanos disk-filling. I fear the answer is partman-fettling and reimage [10:37:28] T423690: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690 [11:25:34] Amir1: can I start the schema change on s7 today? [11:27:21] on a Friday? [11:34:46] in the past we've been running schema changes [12:41:33] federico3: yeah. I'm done on s7 [12:41:49] thanks [12:57:49] Amir1: marostegui: do you have an estimated timeline for the s4/x4 split? [12:58:14] Amir1: ^ [12:59:15] for the replicas we were planning to have x4 on clouddb102[45] and keep s4 on the current ones clouddb101[59]... but given 1019 is dead we're re-evaluating the plan [12:59:23] T409557 [12:59:23] T409557: Productionize new clouddb* hosts (clouddb1022-1033) - https://phabricator.wikimedia.org/T409557 [13:01:14] Hi folks - gentle reminder to put any essential work updates for this week on our shared doc by 15:00 UTC (about 2 hours hence) please :) I know there's a bunch there already. [13:02:03] dhinus: definitely before end of Q4 [13:02:34] Amir1: ack, so I think that will be _before_ we get the other clouddb hosts that we're waiting for [13:02:50] But not sure exactly when [13:04:49] marostegui: IIUC we need two copies of s4 in clouddbs, so that after the split one remains s4, and the other one can become x4 [13:05:18] dhinus: yes, but that's for later [13:05:30] 1024 and 1025 were originally planned for x4 [13:05:45] But they are now s4 because x4 isn't a reality in puppet [13:06:01] yes, my worry is that the split happens in prod let's say end of q4, we still don't have the new hosts... so 1024/5 will suddenly lose some tables [13:06:35] dhinus: we aren't going to delete tables automatically [13:06:47] true, but can they still replicate from prod? [13:07:00] replicate what? [13:07:22] the data that will be x4 in prod... how does it get to the replicas? can it go from x4 in prod -> s4 in clouddb? [13:07:55] they will be moved to their x4 masters [13:08:07] once moved tables will be deleted in s4, so they won't reach x4 [13:16:50] I'm not following... right now clouddb102[45] are replicating s4 from the sanitariums. after the split in prod, if we don't do anything I expect clouddb102[45] will keep replicating s4, but they won't replicate new data being written into x4 [13:18:18] dhinus: 1024 and 1025 are originally planned to be on x4, and thus, they'll be moved to x4 master, so they'll have x4 data [13:18:51] but if we do that, we lose again the redundancy for s4... because s4 would be left only on 1015 [13:19:06] Well yes, of course, because clouddb1019 had HW issues [13:19:11] But we don't have any other hosts [13:19:13] unless we already have 1032 up and running, but it looks like it's taking longer than Q4? [13:19:27] dhinus: yes that's the whole problem [13:19:49] lag may happen again [13:19:55] on bacula db [13:20:23] marostegui: so you're saying we'll have to accept having no redundancy on s4 between the day of the split in prod, and the day we finally get 1032? [13:20:33] dhinus: we have no more hardware [13:21:58] I was wondering if we could temporarily have a host with 3 sections (e.g. s3, x3 and x4), until we get the new hardware... [13:22:34] We may, for now let's address the current issue [13:22:45] ok! [13:53:34] o/, I want to clean up after myself in T422546 -- I created temporary tables for the (rejected) new ICU upgrade process. The `sql.php` user from deployment hosts (understandably) doesn't have DROP TABLE privileges. How should I do the cleanup? [13:53:35] T422546: Clean up after the ICU 72 upgrade - https://phabricator.wikimedia.org/T422546 [13:53:59] (I don't necessarily have to drop tables on a Friday :D just wondering how to go about it) [13:55:37] Raine: can I bash this? :P [13:55:55] ihurbain: sure :D [13:56:36] !bash (I don't necessarily have to drop tables on a Friday :D just wondering how to go about it) [13:56:36] ihurbain: Stored quip at https://bash.toolforge.org/quip/XF-6m50B8tZ8Ohr0rtlA [14:11:36] Raine: If you specify the wikis the tables are at and the name of the tables, I can do it for you next week [14:12:25] marostegui: great, thanks, I'll plop it on the task and ping you [14:12:35] you can assign the task to me if you want [14:52:25] marostegui: can you please review this announcement? https://etherpad.wikimedia.org/p/FJfXKQLwuHX49XPPBRY7 [14:52:39] marostegui: assigned, thank you! [14:53:07] dhinus: looks good, I'd add the task where clouddb1019 crashed [14:53:10] Raine: thank you [14:53:16] marostegui: good point let me add it [14:53:43] dhinus: other than that, +1 [14:55:09] added, check now :) [14:55:43] dhinus: looks good, thank you [14:55:52] ack sending! [15:00:13] (waiting for a +1 from wmcs as well, just in case...)