[00:44:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [00:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:54:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [00:59:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:30:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [03:35:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [03:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:04:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:21:56] FIRING: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:16:56] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:16:56] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:01:19] 06cloud-services-team, 10Data-Services (Quota-requests): User has exceeded the 'max_user_connections' (10) on Toolforge DB replicas - https://phabricator.wikimedia.org/T384119#10516443 (10MBH) Thank you. //at the moment we use utf8mb3 across all tables// - it isn't true at least for all wikiproject DBs, as... [12:13:13] (03CR) 10Andrew Bogott: [C:03+2] restart_openstack: restart cinder-api [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1115509 (owner: 10Andrew Bogott) [12:16:56] RESOLVED: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:16:56] RESOLVED: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:17:35] (03Merged) 10jenkins-bot: restart_openstack: restart cinder-api [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1115509 (owner: 10Andrew Bogott) [12:33:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [12:41:06] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:42:49] 06cloud-services-team, 10Data-Services (Quota-requests): User has exceeded the 'max_user_connections' (10) on Toolforge DB replicas - https://phabricator.wikimedia.org/T384119#10516581 (10fnegri) > at the moment we use utf8mb3 across all tables - it isn't true at least for all wikiproject DBs, as far as I... [12:43:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [12:45:02] 06cloud-services-team, 10Data-Services (Quota-requests): User has exceeded the 'max_user_connections' (10) on Toolforge DB replicas - https://phabricator.wikimedia.org/T384119#10516583 (10MBH) So, maybe you'll convert it too? [12:46:06] RESOLVED: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:50:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:55:06] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:00:29] 06cloud-services-team, 06Data-Persistence: meta_p: Don't use utf8mb3 charset and collation - https://phabricator.wikimedia.org/T385456 (10fnegri) 03NEW [13:05:08] 06cloud-services-team, 10Tool-openstack-browser: Openstack Browser missing many projects - https://phabricator.wikimedia.org/T385459 (10Andrew) 03NEW [13:09:48] FIRING: PuppetFailure: Puppet has failed on cloudcephosd2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:09:48] FIRING: [2x] PuppetFailure: Puppet has failed on cloudcephosd1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:09:53] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcephosd2004-dev:9100 - https://phabricator.wikimedia.org/T385461 (10phaultfinder) 03NEW [13:09:53] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385462 (10phaultfinder) 03NEW [13:11:09] 06cloud-services-team, 10Data-Services (Quota-requests): User has exceeded the 'max_user_connections' (10) on Toolforge DB replicas - https://phabricator.wikimedia.org/T384119#10516766 (10fnegri) I created {T385456} to discuss that with the #data-persistence team as they maintain the database schemas. I b... [13:14:48] FIRING: PuppetFailure: Puppet has failed on testhost2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:14:48] FIRING: [6x] PuppetFailure: Puppet has failed on cloudcephmon1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:14:59] 06cloud-services-team: PuppetFailure Puppet has failed on testhost2001:9100 - https://phabricator.wikimedia.org/T385464 (10phaultfinder) 03NEW [13:15:00] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385462#10516789 (10phaultfinder) [13:15:42] 06cloud-services-team, 06Data-Persistence: meta_p: Don't use utf8mb3 charset and collation - https://phabricator.wikimedia.org/T385456#10516798 (10fnegri) Related: {T194125} [13:15:55] 06cloud-services-team, 06Data-Persistence: meta_p: Don't use utf8mb3 charset and collation - https://phabricator.wikimedia.org/T385456#10516803 (10fnegri) [13:19:48] FIRING: [3x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:19:48] FIRING: [23x] PuppetFailure: Puppet has failed on cloudcephmon1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:19:55] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385462#10516817 (10phaultfinder) [13:19:56] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385462#10516818 (10phaultfinder) [13:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:22:35] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385462#10516844 (10jcrespo) →14Duplicate dup:03T380960 [13:22:38] 06cloud-services-team, 13Patch-For-Review: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10516839 (10jcrespo) [13:22:39] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcephosd2004-dev:9100 - https://phabricator.wikimedia.org/T385461#10516846 (10jcrespo) →14Duplicate dup:03T380960 [13:22:41] 06cloud-services-team: PuppetFailure Puppet has failed on testhost2001:9100 - https://phabricator.wikimedia.org/T385464#10516845 (10jcrespo) →14Duplicate dup:03T380960 [13:22:53] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-7 [13:23:40] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-7 [13:23:45] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-8 [13:24:20] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-8 [13:24:48] FIRING: [5x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:24:48] FIRING: [47x] PuppetFailure: Puppet has failed on cloudcephmon1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:24:56] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385465 (10phaultfinder) 03NEW [13:24:58] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385466 (10phaultfinder) 03NEW [13:25:37] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-control-9, tools-k8s-ingress-7, tools-k8s-ingress-8, tools-k8s-ingress-9 [13:27:16] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385465#10516883 (10aborrero) →14Duplicate dup:03T380960 [13:27:17] 06cloud-services-team, 13Patch-For-Review: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10516885 (10aborrero) [13:27:35] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385466#10516907 (10aborrero) →14Duplicate dup:03T380960 [13:27:39] 06cloud-services-team, 13Patch-For-Review: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10516909 (10aborrero) [13:29:10] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-control-9, tools-k8s-ingress-7, tools-k8s-ingress-8, tools-k8s-ingress-9 [13:29:14] FIRING: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [13:29:48] FIRING: [6x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:29:48] FIRING: [75x] PuppetFailure: Puppet has failed on cloudbackup2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:29:52] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385469 (10phaultfinder) 03NEW [13:29:54] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385470 (10phaultfinder) 03NEW [13:31:39] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385470#10516988 (10aborrero) →14Duplicate dup:03T380960 [13:31:45] 06cloud-services-team, 13Patch-For-Review: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10516990 (10aborrero) [13:31:52] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385469#10516993 (10aborrero) →14Duplicate dup:03T380960 [13:31:55] 06cloud-services-team: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10516995 (10aborrero) [13:34:14] RESOLVED: [5x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [13:34:48] FIRING: [7x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:34:48] FIRING: [93x] PuppetFailure: Puppet has failed on cloudbackup1002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:34:53] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385471 (10phaultfinder) 03NEW [13:34:55] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385472 (10phaultfinder) 03NEW [13:35:25] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385472#10517034 (10aborrero) →14Duplicate dup:03T380960 [13:35:28] 06cloud-services-team: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10517036 (10aborrero) [13:35:33] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385471#10517039 (10aborrero) →14Duplicate dup:03T380960 [13:35:35] 06cloud-services-team: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10517041 (10aborrero) [13:35:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:39:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:39:48] FIRING: [109x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:39:55] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385474 (10phaultfinder) 03NEW [13:39:58] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385475 (10phaultfinder) 03NEW [13:47:27] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385474#10517090 (10aborrero) →14Duplicate dup:03T380960 [13:47:33] 06cloud-services-team, 13Patch-For-Review: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10517092 (10aborrero) [13:47:39] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T385475#10517095 (10aborrero) →14Duplicate dup:03T380960 [13:47:44] 06cloud-services-team, 13Patch-For-Review: kernel error detector: have a way to ignore certain messages - https://phabricator.wikimedia.org/T380960#10517097 (10aborrero) [13:54:32] 06cloud-services-team, 10Tool-openstack-browser: Openstack Browser missing many projects - https://phabricator.wikimedia.org/T385459#10517129 (10Andrew) 05Open→03Invalid ...and now it works. [13:54:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:54:48] FIRING: [109x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:55:56] FIRING: [6x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudcephosd1014. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:59:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:59:48] FIRING: [109x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:00:56] FIRING: [85x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:01:50] 06cloud-services-team, 10Toolforge, 10Tools: Flickr blocking image requests from Toolforge k8s, breaking multiple tools - https://phabricator.wikimedia.org/T384468#10517167 (10AntiCompositeNumber) You should be using a User-Agent that identifies your particular bot. [14:04:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:04:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:04:48] FIRING: [109x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:05:56] FIRING: [87x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:09:06] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:09:48] FIRING: [8x] PuppetFailure: Puppet has failed on cloudcontrol2007-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:09:48] FIRING: [109x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:10:45] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:10:56] FIRING: [90x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:12:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:14:48] FIRING: [5x] PuppetFailure: Puppet has failed on cloudcontrol2008-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:14:48] FIRING: [103x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:15:45] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:15:56] FIRING: [92x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:17:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:19:48] RESOLVED: [4x] PuppetFailure: Puppet has failed on cloudcontrol2008-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:19:49] FIRING: [85x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:22:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:24:48] FIRING: [59x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:25:56] FIRING: [85x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:29:48] FIRING: [35x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:30:56] FIRING: [85x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:34:48] FIRING: [26x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:35:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:40:33] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-haproxy-5, tools-k8s-haproxy-6 [14:40:39] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-haproxy-5, tools-k8s-haproxy-6 [14:40:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:45:56] FIRING: [94x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:47:04] 10Tool-openstack-browser: Openstack Browser lists a user a user that horizon doesn't - https://phabricator.wikimedia.org/T382152#10517390 (10Andrew) It looks to me like the role assignment for JBennet is dangling and the user itself has been deleted from ldap. Does that seem possible/likely? [14:50:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:56:58] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops, 07User-notice: Communication for Wikitech/Wikimedia Developer Account migration - https://phabricator.wikimedia.org/T373615#10517436 (10jijiki) 05Open→03Resolved a:03jijiki I think this task has served its purpose, closing [14:58:00] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: Cleanup: Wikitech code leftovers - https://phabricator.wikimedia.org/T371378#10517446 (10jijiki) [14:58:10] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707#10517447 (10jijiki) [14:58:29] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: Cleanup: Wikitech code leftovers - https://phabricator.wikimedia.org/T371378#10517461 (10jijiki) [14:58:39] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707#10517462 (10jijiki) [15:00:25] 10wikitech.wikimedia.org: OAuth consumers registered locally at Wikitech are no longer configured to be used - https://phabricator.wikimedia.org/T376188#10517467 (10jijiki) [15:00:26] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707#10517468 (10jijiki) [15:00:32] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#10517469 (10jijiki) [15:00:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:02:00] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: ☂ Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707#10517475 (10jijiki) 05In progress→03Resolved I think we can close this, kudos to everyone who worked on making this happen! [15:05:56] FIRING: [92x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:10:56] FIRING: [93x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:15:56] FIRING: [94x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:20:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:25:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:27:34] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: horizon: enable the UI to select networks on VM creation panel - https://phabricator.wikimedia.org/T380081#10517610 (10Andrew) a:05Andrew→03aborrero This panel is now enabled in codfw1dev; the panel takes a default so that interacting with... [15:30:56] FIRING: [95x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:40:56] FIRING: [93x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:45:56] FIRING: [92x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:46:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate pontoon-kafka-01.monitoring.eqiad.wmflabs is about to expire in 27d 23h 58m 16s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:50:56] FIRING: [90x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:50:56] FIRING: [3x] SystemdUnitDown: The systemd unit prometheus-node-kernel-messages.service on node cloudcephosd1014 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:51:01] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T385491 (10phaultfinder) 03NEW [15:52:03] 10wikitech.wikimedia.org, 10MW-on-K8s, 06serviceops: Cleanup: Wikitech code leftovers - https://phabricator.wikimedia.org/T371378#10517702 (10Aklapper) [15:55:56] FIRING: [48x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:55:56] FIRING: [53x] SystemdUnitDown: The systemd unit prometheus-node-kernel-messages.service on node cloudcephmon1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:56:01] 06cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T385491#10517724 (10phaultfinder) [16:00:56] FIRING: [54x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:00:56] FIRING: [53x] SystemdUnitDown: The systemd unit prometheus-node-kernel-messages.service on node cloudcephmon1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:05:56] FIRING: [68x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:05:56] FIRING: [53x] SystemdUnitDown: The systemd unit prometheus-node-kernel-messages.service on node cloudcephmon1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:10:56] RESOLVED: [62x] SystemdUnitDown: The service unit prometheus-node-kernel-messages.service is in failed status on host cloudcephmon1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:10:56] RESOLVED: [37x] SystemdUnitDown: The systemd unit prometheus-node-kernel-messages.service on node cloudcephmon1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:26:10] 06cloud-services-team, 10Tool-openstack-browser: Openstack Browser missing many projects - https://phabricator.wikimedia.org/T385459#10517800 (10bd808) @Andrew A likely better bug to file is that when anything goes wrong in openstack browser the error that gets displayed is the `are you just guessing?` mes... [16:39:16] (03merge) 10fnegri: Create single-node clusters by default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/223 (https://phabricator.wikimedia.org/T385082) [16:42:48] 06cloud-services-team, 06Data-Persistence: meta_p: Don't use utf8mb3 charset and collation - https://phabricator.wikimedia.org/T385456#10517909 (10bd808) I would guess the encodings in question were the default for the instance at the point in time the tables were created. The code in `maintain-meta_p.py` does... [16:43:12] 06cloud-services-team, 10Toolforge: [lima-kilo] when using "--ha", some containers are not restarting after restarting the VM - https://phabricator.wikimedia.org/T385082#10517912 (10fnegri) p:05High→03Low [16:44:07] 06cloud-services-team, 10Data-Services, 06Data-Persistence: meta_p: Don't use utf8mb3 charset and collation - https://phabricator.wikimedia.org/T385456#10517917 (10taavi) [16:44:10] 06cloud-services-team, 10Toolforge: [lima-kilo] when using "--ha", some containers are not restarting after restarting the VM - https://phabricator.wikimedia.org/T385082#10517918 (10fnegri) 05In progress→03Resolved With the patch above, non-HA becomes the default setup, so this issue becomes less urgen... [16:44:18] 06cloud-services-team, 10Toolforge: [lima-kilo] when using "--ha", some containers are not restarting after restarting the VM - https://phabricator.wikimedia.org/T385082#10517921 (10fnegri) 05Resolved→03Open [16:48:59] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262#10517945 (10fnegri) 05In progress→03Resolved I can now ping the same IP successfully, and the alert is gone. Maybe rela... [17:01:46] 10cloud-services-team (Hardware), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10518010 (10aborrero) >>! In T382412#10512402, @cmooney wrote: > I'm guessing you're gonna migrate by removing on... [17:03:41] 06cloud-services-team, 10Data-Services, 06Data-Persistence: meta_p: Don't use utf8mb3 charset and collation - https://phabricator.wikimedia.org/T385456#10518015 (10fnegri) > the meta_p.wiki tables only contain ASCII compatible characters The `name` column contains many UTF characters, so I can see potential... [17:14:48] FIRING: [2x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:19:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on cloudbackup1001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:22:35] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518134 (10Fabfur) ` |**Wikitech account/LDAP:**| fabfur| |**SUL account**| Fabfur-WMF| |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |Y| |**I have visited [[ https://wikitec... [18:01:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [18:03:10] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518438 (10Reedy) >>! In T376267#10518134, @Fabfur wrote: > > |**Wikitech account/LDAP:**| fabfur| > |**SUL account**| Fabfur-WMF| > |**Account linked on [[ https://idm.wikimedia.org/ | IDM... [18:06:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [18:25:08] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518535 (10CDobbins) |**Wikitech account/LDAP:**| cdobbins | |**SUL account**| CDobbins-WMF| |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |Y| |**I have visited [[ https://w... [18:30:04] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518556 (10Aca) >>! In T376267#10518438, @Reedy wrote: > Would you like your wikitech account renaming and attaching? Yes, please! :) [18:31:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [18:42:58] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518620 (10Fabfur) >>! In T376267#10518438, @Reedy wrote: >>>! In T376267#10518134, @Fabfur wrote: >> >> |**Wikitech account/LDAP:**| fabfur| >> |**SUL account**| Fabfur-WMF| >> |**Account... [18:54:46] 06cloud-services-team, 10Horizon: Horizon: obsessive redirects during logins - https://phabricator.wikimedia.org/T383370#10518651 (10Andrew) When I reduce the number of keystone agents down to one, everything seems to work dependably. So I think this is some kind of split-brain between keystone agents; I suspe... [18:56:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [18:56:41] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:01:45] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518837 (10Reedy) >>! In T376267#10518620, @Fabfur wrote: >>>! In T376267#10518438, @Reedy wrote: >>>>! In T376267#10518134, @Fabfur wrote: >>> >>> |**Wikitech account/LDAP:**| fabfur| >>>... [20:02:47] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518840 (10Reedy) >>! In T376267#10518556, @Aca wrote: >>>! In T376267#10518438, @Reedy wrote: >> Would you like your wikitech account renaming and attaching? > Yes, please! :) Done! reedy... [20:03:24] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10518842 (10Reedy) >>! In T376267#10518535, @CDobbins wrote: > > |**Wikitech account/LDAP:**| cdobbins | > |**SUL account**| CDobbins-WMF| > |**Account linked on [[ https://idm.wikimedia.org... [20:38:10] 06cloud-services-team, 10Horizon: Horizon: obsessive redirects during logins - https://phabricator.wikimedia.org/T383370#10518992 (10Andrew) Forcing all keystone clients to use a shared memcached doesn't seem to help. [20:52:39] 06cloud-services-team, 10Horizon: Horizon: obsessive redirects during logins - https://phabricator.wikimedia.org/T383370#10519017 (10Andrew) I thought that the keystone server was largely stateless, thanks to fernet keys... treating it like it's stateful resolves this issue but I wish I understood better what'... [21:11:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:16:26] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:18:32] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10519084 (10CDobbins) >>! In T376267#10518842, @Reedy wrote: >>>! In T376267#10518535, @CDobbins wrote: >> >> |**Wikitech account/LDAP:**| cdobbins | >> |**SUL account**| CDobbins-WMF| >> |*... [21:22:45] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10519111 (10Reedy) >>! In T376267#10519084, @CDobbins wrote: >>>! In T376267#10518842, @Reedy wrote: >>>>! In T376267#10518535, @CDobbins wrote: >>> >>> |**Wikitech account/LDAP:**| cdobbins... [21:44:21] 06cloud-services-team, 10Horizon: Horizon: requires access to openstack.eqiad1.wikimediacloud.org port 25000 - https://phabricator.wikimedia.org/T385527 (10Andrew) 03NEW [21:47:35] 06cloud-services-team, 10Horizon: Horizon: requires access to openstack.eqiad1.wikimediacloud.org port 25000 - https://phabricator.wikimedia.org/T385527#10519171 (10Andrew) Correction, further testing suggest that @AntiCompositeNumber can access that port after all, and just had latent bad effects from T383370... [22:41:26] 06cloud-services-team, 10Cloud-VPS: Unable to persistently set fs.inotify.max_user_instances and fs.inotify.max_user_watches - https://phabricator.wikimedia.org/T385530 (10SDunlap) 03NEW [22:42:33] 06cloud-services-team, 10Cloud-VPS: Unable to persistently set fs.inotify.max_user_instances and fs.inotify.max_user_watches - https://phabricator.wikimedia.org/T385530#10519260 (10SDunlap) [22:44:25] (03open) 10raymond-ndibe: test commit [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/142 [22:48:19] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10519275 (10Danmichaelo) |**Wikitech account/LDAP:**| Danmichaelo| |**SUL account**| Danmichaelo| |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |N| |**I have visited [[ https:... [23:00:21] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10519303 (10Reedy) ` reedy@deploy2002:~$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=labswiki --userlist ~/usernames.txt DEPRECATION WARNING: Maintenance scripts are... [23:19:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks