[08:41:55] FIRING: MaxConntrack: Max conntrack at 82.97% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:11:55] RESOLVED: MaxConntrack: Max conntrack at 81.2% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:17:25] FIRING: MaxConntrack: Max conntrack at 81.83% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:32:25] RESOLVED: MaxConntrack: Max conntrack at 80.41% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:25:18] jhathaway: o/ I was testing the disk hot swap on ms-be2088 (https://phabricator.wikimedia.org/T384003), I see that you have worked/reimaged it.. If you need the host I can use a different one, let's coordinate otherwise DCops will not like me :D [11:51:55] FIRING: MaxConntrack: Max conntrack at 80.02% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:56:55] RESOLVED: MaxConntrack: Max conntrack at 80.02% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [12:07:55] FIRING: MaxConntrack: Max conntrack at 80.59% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [12:12:55] RESOLVED: MaxConntrack: Max conntrack at 80.59% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:11:55] FIRING: MaxConntrack: Max conntrack at 80.26% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:16:55] RESOLVED: MaxConntrack: Max conntrack at 80.26% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [13:17:44] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [13:22:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [13:27:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [13:44:34] FIRING: DiskSpace: Disk space krb1001:9100:/ 4.266% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:03:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:56] FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:08:25] FIRING: [4x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:24] moritzm: o/ [14:12:43] something changed in the traffic handled by krb1001, or similar [14:12:46] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=krb1001&var-datasource=thanos&var-cluster=misc&from=now-2d&to=now [14:13:14] do you know if anything has changed? [14:13:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:51] nothing I'm aware of, usually some analytics service changed and that bumps usage quite a but [14:14:06] but i currenly need to look into some ganeti issue [14:14:52] I am going to check [14:17:43] sorry elukey, should have left a note last night, I have been using that host to try and reproduce the double d-i installer issue [14:17:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [14:18:28] jhathaway: np! <3 [14:18:52] happy to use another host, if you want to use that one, or leave it re-imaged at the end of the day [14:19:03] just ran out of time last evening [14:19:56] RESOLVED: MaxConntrack: Max conntrack at 99.99% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:20:08] nono please go ahead! [14:20:28] I have selected the debian os to boot because I connected via mgmt console [14:20:38] not sure if I have interfered with an ongoing test [14:22:17] not at all, it was in a half installed stated last night, before I gave up [14:25:28] FIRING: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:27:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [14:33:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:55] FIRING: MaxConntrack: Max conntrack at 81.27% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:40:28] RESOLVED: SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:42:56] RESOLVED: MaxConntrack: Max conntrack at 81.27% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:47:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [14:57:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [15:04:34] RESOLVED: DiskSpace: Disk space krb1001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:04:39] krb1001 good, a heavy analytics job was hitting presto hard [15:06:17] should we reduce the logging verbosity or rate-limit/throttle in the longer run? [15:08:03] could be good yes, not sure how to do it though [15:08:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:38] we looked into it in the past, but the problem is that Presto simply does an excessive amount of queries [15:09:41] instead of using Kerberos as it's designed for, i.e. obtaining a ticket and then using that for auth [15:11:03] but Ben is planning to switch the intra-cluster communication towards JWT, that address it as well: https://phabricator.wikimedia.org/T358196 [15:11:11] ack [15:11:24] (intra-cluster communication within Presto) [15:23:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:25] FIRING: [3x] SystemdUnitFailed: logrotate.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:25] RESOLVED: SystemdUnitFailed: logrotate.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10587904 (10cmooney) [20:57:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:07:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting