[00:00:05] 06SRE-OnFire, 10Incident Tooling: Corto: Incident responder workflow automation (MVP) - https://phabricator.wikimedia.org/T356790#10328704 (10Eevans) p:05Triage→03Medium [00:18:45] PROBLEM - Disk space on Hadoop worker on an-worker1088 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [00:25:13] 06SRE-OnFire, 10Incident Tooling: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#10328739 (10BCornwall) a:05BCornwall→03None [00:30:07] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091475 (owner: 10TrainBranchBot) [00:30:22] (03CR) 10BCornwall: [C:03+1] "Baller." [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [00:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091834 [00:38:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091834 (owner: 10TrainBranchBot) [00:44:40] !log removing 103 files for legal compliance [00:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:45] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [00:57:09] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10328760 (10Platonides) >>! In T380009#10326074, @eoghan wrote: > So the issue is coming from the vrts_aliases.py cron job.... [01:08:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091834 (owner: 10TrainBranchBot) [01:08:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091835 [01:08:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091835 (owner: 10TrainBranchBot) [01:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 934.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:36:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 877.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:37:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 800.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:38:13] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091835 (owner: 10TrainBranchBot) [01:42:45] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 848.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:57:11] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/7bc012c19bd1cc85b9b647637aba543df81b9adba0ed7a91e9277687094797e4/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:17:11] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:49:54] (03PS1) 10Bartosz Dziewoński: Rename config settings and functions referring to SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091839 (https://phabricator.wikimedia.org/T379811) [02:57:09] (03PS2) 10Bartosz Dziewoński: Rename config settings and functions referring to SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091839 (https://phabricator.wikimedia.org/T379811) [03:19:15] (03PS3) 10Bartosz Dziewoński: Rename everything referring to "SSO domain" to use "shared domain" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091839 (https://phabricator.wikimedia.org/T379811) [03:19:15] (03PS1) 10Bartosz Dziewoński: Rename shared domain sso.wikimedia.org to auth.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091841 (https://phabricator.wikimedia.org/T379811) [03:19:17] (03PS1) 10Bartosz Dziewoński: Use DB name rather than server name in shared domain path prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091842 (https://phabricator.wikimedia.org/T379811) [03:22:18] (03PS1) 10Bartosz Dziewoński: Rename sso.wikimedia.beta.wmflabs.org to auth.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1091843 (https://phabricator.wikimedia.org/T379811) [03:24:28] (03PS2) 10Bartosz Dziewoński: Rename shared domain sso.wikimedia.org to auth.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091841 (https://phabricator.wikimedia.org/T379811) [03:24:50] (03PS2) 10Bartosz Dziewoński: Rename sso.wikimedia.beta.wmflabs.org to auth.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1091843 (https://phabricator.wikimedia.org/T379811) [03:25:01] (03PS2) 10Bartosz Dziewoński: Use DB name rather than server name in shared domain path prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091842 (https://phabricator.wikimedia.org/T379811) [03:32:43] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:35:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:43] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:31:05] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:40:59] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:41:01] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 4.616 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:41:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:31] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:55:02] (03PS4) 10Majavah: keepalived: Split failover config template to new class [puppet] - 10https://gerrit.wikimedia.org/r/1091732 (https://phabricator.wikimedia.org/T380057) [07:55:02] (03PS6) 10Majavah: keepalived::failover: Support IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091733 (https://phabricator.wikimedia.org/T380057) [07:55:02] (03PS4) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) [07:55:03] (03PS9) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [07:55:04] (03PS8) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [07:55:05] (03PS1) 10Majavah: dynamicproxy: Run Redis update in app context [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175) [08:16:18] (03PS5) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) [08:16:18] (03PS2) 10Majavah: dynamicproxy: Run Redis update in app context [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175) [08:16:18] (03PS10) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [08:16:18] (03PS9) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [08:16:19] (03PS1) 10Majavah: dynamicproxy: Bind Redis on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091849 (https://phabricator.wikimedia.org/T379175) [08:31:59] (03CR) 10Majavah: [C:04-2] "needs to wait before eqiad1 has v6-capable hosts deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:33:49] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1290:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1290 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:37:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1290:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1290 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:42:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1290:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1290 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:01:51] (03PS1) 10Majavah: opnestack: keystone: Do not provision DNS zones for service projects [puppet] - 10https://gerrit.wikimedia.org/r/1091850 (https://phabricator.wikimedia.org/T380095) [10:02:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1290:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1290 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:21:53] (03Abandoned) 10Zabe: Drop WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/1077345 (https://phabricator.wikimedia.org/T371592) (owner: 10Zabe) [11:16:42] 06SRE: [ceph] osd daemon died - 2024-11-06 - https://phabricator.wikimedia.org/T379142#10328905 (10Peachey88) [12:03:16] (03PS1) 10Majavah: hieradata: Bump codfw1dev Horizon to 2024-11-16-115718 [puppet] - 10https://gerrit.wikimedia.org/r/1091854 [12:03:24] (03CR) 10Majavah: [C:03+2] hieradata: Bump codfw1dev Horizon to 2024-11-16-115718 [puppet] - 10https://gerrit.wikimedia.org/r/1091854 (owner: 10Majavah) [12:37:54] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10328967 (10taavi) [12:38:03] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10328969 (10taavi) [13:39:30] (03PS1) 10Majavah: Bump Horizon globally to 2024-11-16-131149 [puppet] - 10https://gerrit.wikimedia.org/r/1091860 [13:49:31] (03CR) 10Majavah: [C:03+2] Bump Horizon globally to 2024-11-16-131149 [puppet] - 10https://gerrit.wikimedia.org/r/1091860 (owner: 10Majavah) [14:37:58] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10329055 (10revi) >>! In T380009#10328760, @Platonides wrote: >>>! In T380009#10326074, @eoghan wrote: >> So the issue is c... [16:17:38] here [16:18:47] I'm fixing it right now [16:18:49] don't depool [16:19:01] TY [16:19:23] fixed now [16:19:37] another one of the "index is corrupt" [16:20:01] I'll create a task for tracking [16:20:09] thanks [16:20:55] PROBLEM - MariaDB Replica SQL: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: metawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:21:51] Interesting that we only get this here now [16:24:33] i'll depool [16:26:46] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10329074 (10LSobanski) We're in touch with ITS and will be doing follow up testing with them next week. [16:27:36] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [16:30:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [16:30:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [16:30:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:31:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1319.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:31:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1314.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:31:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1315.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:32:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1316.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:32:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1317.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:32:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1318.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:35:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1325.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:36:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1320.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:36:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1321.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:36:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1322.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:36:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1324.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:36:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1323.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:42:20] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1327.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:42:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1326.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:49:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1314.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:49:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1317.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:50:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1315.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:51:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1318.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:52:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1316.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:53:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1319.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:53:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1325.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:54:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1320.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:54:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1322.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:55:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1324.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:57:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1321.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:01:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1326.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:05:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1327.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:05:57] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:08:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1313.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:09:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [17:09:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [17:09:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:11:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1327.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:11:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1327.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:14:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1323.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:16:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install wikikube-worker13[13-28] - https://phabricator.wikimedia.org/T378185#10329089 (10Jclark-ctr) [17:27:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:45:50] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:50:36] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [17:50:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [17:50:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:52:28] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:52:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1313.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:53:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-jumbo1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:55:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-jumbo1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:55:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-jumbo1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:56:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [17:56:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [17:56:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-jumbo1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:57:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [17:59:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-jumbo1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:59:25] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [18:00:52] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10329144 (10Jclark-ctr) [18:01:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:32] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [18:06:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1183.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:06:38] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10329146 (10Jclark-ctr) [18:08:59] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [18:09:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for wikikube-worker - jclark@cumin1002" [18:09:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:21] PROBLEM - MariaDB Replica SQL: s7 on db1171 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: metawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:55:11] PROBLEM - MariaDB Replica Lag: s7 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:24:44] (03CR) 10Gergő Tisza: [C:03+1] Rename sso.wikimedia.beta.wmflabs.org to auth.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1091843 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [19:29:52] (03CR) 10Gergő Tisza: [C:03+1] Rename everything referring to "SSO domain" to use "shared domain" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091839 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [19:33:36] (03CR) 10Gergő Tisza: [C:03+1] Rename shared domain sso.wikimedia.org to auth.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091841 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [19:41:41] (03CR) 10Gergő Tisza: [C:03+1] Use DB name rather than server name in shared domain path prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091842 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [20:10:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:15:21] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:27:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:29:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:29:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:29:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:15:08] (03PS1) 10NMW03: Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) [23:15:48] (03CR) 10CI reject: [V:04-1] Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03) [23:17:01] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03) [23:17:36] (03PS2) 10NMW03: Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317)