[00:00:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:05:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:37] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P37533 and previous config saved to /var/cache/conftool/dbconfig/20221102-001401-ladsgroup.json [00:24:15] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T318605)', diff saved to https://phabricator.wikimedia.org/P37534 and previous config saved to /var/cache/conftool/dbconfig/20221102-002433-ladsgroup.json [00:24:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:29:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37535 and previous config saved to /var/cache/conftool/dbconfig/20221102-002913-ladsgroup.json [00:29:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [00:29:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [00:29:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T318605)', diff saved to https://phabricator.wikimedia.org/P37536 and previous config saved to /var/cache/conftool/dbconfig/20221102-002936-ladsgroup.json [00:29:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:39:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P37537 and previous config saved to /var/cache/conftool/dbconfig/20221102-003941-ladsgroup.json [00:45:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P37538 and previous config saved to /var/cache/conftool/dbconfig/20221102-005451-ladsgroup.json [01:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T318605)', diff saved to https://phabricator.wikimedia.org/P37539 and previous config saved to /var/cache/conftool/dbconfig/20221102-010430-ladsgroup.json [01:04:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:09:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T318605)', diff saved to https://phabricator.wikimedia.org/P37540 and previous config saved to /var/cache/conftool/dbconfig/20221102-010958-ladsgroup.json [01:10:03] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:15:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P37541 and previous config saved to /var/cache/conftool/dbconfig/20221102-011937-ladsgroup.json [01:21:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:49] (Traffic bill over quota) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P37542 and previous config saved to /var/cache/conftool/dbconfig/20221102-013447-ladsgroup.json [01:38:45] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T318605)', diff saved to https://phabricator.wikimedia.org/P37543 and previous config saved to /var/cache/conftool/dbconfig/20221102-014955-ladsgroup.json [01:49:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [01:50:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:50:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [01:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T318605)', diff saved to https://phabricator.wikimedia.org/P37544 and previous config saved to /var/cache/conftool/dbconfig/20221102-015019-ladsgroup.json [01:52:49] (Traffic bill over quota) resolved: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:07] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh2001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed (within 3600 seconds). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [02:24:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T318605)', diff saved to https://phabricator.wikimedia.org/P37545 and previous config saved to /var/cache/conftool/dbconfig/20221102-022408-ladsgroup.json [02:24:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:30:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P37546 and previous config saved to /var/cache/conftool/dbconfig/20221102-023919-ladsgroup.json [02:54:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P37547 and previous config saved to /var/cache/conftool/dbconfig/20221102-025427-ladsgroup.json [02:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:00:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:06:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T318605)', diff saved to https://phabricator.wikimedia.org/P37548 and previous config saved to /var/cache/conftool/dbconfig/20221102-030934-ladsgroup.json [03:09:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [03:09:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:09:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [03:09:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T318605)', diff saved to https://phabricator.wikimedia.org/P37549 and previous config saved to /var/cache/conftool/dbconfig/20221102-030959-ladsgroup.json [03:18:45] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:24:47] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [03:36:55] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T318605)', diff saved to https://phabricator.wikimedia.org/P37550 and previous config saved to /var/cache/conftool/dbconfig/20221102-034312-ladsgroup.json [03:43:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:55:17] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.17 ms [03:58:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P37551 and previous config saved to /var/cache/conftool/dbconfig/20221102-035821-ladsgroup.json [04:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P37552 and previous config saved to /var/cache/conftool/dbconfig/20221102-041330-ladsgroup.json [04:15:41] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:24:15] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:28:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T318605)', diff saved to https://phabricator.wikimedia.org/P37553 and previous config saved to /var/cache/conftool/dbconfig/20221102-042838-ladsgroup.json [04:28:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [04:28:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:28:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [04:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T318605)', diff saved to https://phabricator.wikimedia.org/P37554 and previous config saved to /var/cache/conftool/dbconfig/20221102-042904-ladsgroup.json [04:34:05] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [04:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:23] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:48:54] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [04:51:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T318605)', diff saved to https://phabricator.wikimedia.org/P37555 and previous config saved to /var/cache/conftool/dbconfig/20221102-050222-ladsgroup.json [05:02:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:06:15] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - free space: / 1979 MB (2% inode=82%): /tmp 1979 MB (2% inode=82%): /var/tmp 1979 MB (2% inode=82%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [05:06:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:59] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:13:47] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:13:47] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:15:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:16:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P37556 and previous config saved to /var/cache/conftool/dbconfig/20221102-051730-ladsgroup.json [05:19:07] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [05:21:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:21:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [05:23:47] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:23:47] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [05:27:11] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [05:30:43] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), No backups: 1 (dispatch-be1001), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:32:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P37557 and previous config saved to /var/cache/conftool/dbconfig/20221102-053238-ladsgroup.json [05:37:53] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:25] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:39:31] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:45:27] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.00 ms [05:47:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T318605)', diff saved to https://phabricator.wikimedia.org/P37558 and previous config saved to /var/cache/conftool/dbconfig/20221102-054747-ladsgroup.json [05:47:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:47:51] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:48:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:59:53] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:00:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:55] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [06:06:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:49] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:29] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:28:47] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [06:30:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:30:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:30:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37559 and previous config saved to /var/cache/conftool/dbconfig/20221102-063038-marostegui.json [06:30:51] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:31:51] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:31] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:36:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37560 and previous config saved to /var/cache/conftool/dbconfig/20221102-063653-marostegui.json [06:37:03] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:38:51] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:41:57] (03PS1) 10Marostegui: db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/852036 (https://phabricator.wikimedia.org/T320773) [06:43:00] (03CR) 10Marostegui: [C: 03+2] db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/852036 (https://phabricator.wikimedia.org/T320773) (owner: 10Marostegui) [06:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37561 and previous config saved to /var/cache/conftool/dbconfig/20221102-064357-root.json [06:52:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P37562 and previous config saved to /var/cache/conftool/dbconfig/20221102-065203-marostegui.json [06:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 3%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37563 and previous config saved to /var/cache/conftool/dbconfig/20221102-065903-root.json [06:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:00:04] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T0700). Please do the needful. [07:00:05] Urbanecm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] i'll self-serve [07:00:29] (03PS3) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) [07:00:35] (03CR) 10Urbanecm: [C: 03+2] Deploy GrowthExperiments to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [07:00:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [07:01:22] (03Merged) 10jenkins-bot: Deploy GrowthExperiments to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [07:01:56] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851604|Deploy GrowthExperiments to 100% users at all wikis but dewiki (T320876)]] [07:02:08] T320876: Ensure all newcomers receive Growth features (end the 20% control group) - https://phabricator.wikimedia.org/T320876 [07:02:23] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:851604|Deploy GrowthExperiments to 100% users at all wikis but dewiki (T320876)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:03:31] (03PS1) 10Marostegui: pc2012,pc2014: Promote pc2014 to master [puppet] - 10https://gerrit.wikimedia.org/r/852038 [07:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:04:45] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852039 [07:04:53] (03CR) 10Marostegui: [C: 03+2] pc2012,pc2014: Promote pc2014 to master [puppet] - 10https://gerrit.wikimedia.org/r/852038 (owner: 10Marostegui) [07:05:21] urbanecm: are you done? [07:05:28] marostegui: no, a sync is in progress [07:05:31] I'll ping you when finished! [07:05:41] (should be few more seconds) [07:05:55] urbanecm: Excellent, thanks, no rush :) [07:06:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851604|Deploy GrowthExperiments to 100% users at all wikis but dewiki (T320876)]] (duration: 04m 38s) [07:07:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P37564 and previous config saved to /var/cache/conftool/dbconfig/20221102-070712-marostegui.json [07:07:21] marostegui: I'm done, over to you. [07:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:09:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:09:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:10:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:13:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [07:14:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37565 and previous config saved to /var/cache/conftool/dbconfig/20221102-071410-root.json [07:14:51] urbanecm: thanks!! [07:16:25] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37566 and previous config saved to /var/cache/conftool/dbconfig/20221102-072220-marostegui.json [07:22:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [07:22:26] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:22:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [07:22:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T321123)', diff saved to https://phabricator.wikimedia.org/P37567 and previous config saved to /var/cache/conftool/dbconfig/20221102-072254-marostegui.json [07:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321123)', diff saved to https://phabricator.wikimedia.org/P37568 and previous config saved to /var/cache/conftool/dbconfig/20221102-072508-marostegui.json [07:29:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37569 and previous config saved to /var/cache/conftool/dbconfig/20221102-072916-root.json [07:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:53] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:852039|ProductionServices.php: Promote pc2014 to pc2 master]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:42:17] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (3) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [07:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37571 and previous config saved to /var/cache/conftool/dbconfig/20221102-074422-root.json [07:44:42] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:852039|ProductionServices.php: Promote pc2014 to pc2 master]] (duration: 04m 13s) [07:46:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:47:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:47:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:48:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:49:43] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851022 [07:49:57] (03PS1) 10Marostegui: Revert "pc2012,pc2014: Promote pc2014 to master" [puppet] - 10https://gerrit.wikimedia.org/r/851023 [07:55:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P37572 and previous config saved to /var/cache/conftool/dbconfig/20221102-075527-marostegui.json [07:55:39] (03PS2) 10Muehlenhoff: Remove access for ejoseph [puppet] - 10https://gerrit.wikimedia.org/r/851641 [07:55:49] (03CR) 10Marostegui: [C: 03+2] Revert "pc2012,pc2014: Promote pc2014 to master" [puppet] - 10https://gerrit.wikimedia.org/r/851023 (owner: 10Marostegui) [07:57:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ejoseph [puppet] - 10https://gerrit.wikimedia.org/r/851641 (owner: 10Muehlenhoff) [07:59:29] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851022 (owner: 10Marostegui) [07:59:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37573 and previous config saved to /var/cache/conftool/dbconfig/20221102-075930-root.json [08:00:04] jeena and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T0800). [08:00:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:15] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851022 (owner: 10Marostegui) [08:00:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851022 (owner: 10Marostegui) [08:00:47] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:851022|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] [08:01:10] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:851022|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:02:50] (03PS1) 10Marostegui: pc2014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/852124 [08:03:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:04:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:04:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:04:52] (03PS5) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [08:04:56] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/852124 (owner: 10Marostegui) [08:05:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:05:48] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:851022|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] (duration: 05m 01s) [08:09:26] (03PS1) 10Marostegui: pc1012,pc1014: Promote pc1014 to master [puppet] - 10https://gerrit.wikimedia.org/r/852125 [08:10:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:10:31] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852126 [08:10:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321123)', diff saved to https://phabricator.wikimedia.org/P37574 and previous config saved to /var/cache/conftool/dbconfig/20221102-081034-marostegui.json [08:10:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:10:41] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:10:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:10:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T321123)', diff saved to https://phabricator.wikimedia.org/P37575 and previous config saved to /var/cache/conftool/dbconfig/20221102-081059-marostegui.json [08:11:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:11:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:11:35] (03PS2) 10Marostegui: pc1012,pc1014: Promote pc1014 to master [puppet] - 10https://gerrit.wikimedia.org/r/852125 [08:11:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:12:14] (03CR) 10Marostegui: [C: 03+2] pc1012,pc1014: Promote pc1014 to master [puppet] - 10https://gerrit.wikimedia.org/r/852125 (owner: 10Marostegui) [08:12:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (3) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:12:24] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852126 (owner: 10Marostegui) [08:13:08] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852126 (owner: 10Marostegui) [08:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321123)', diff saved to https://phabricator.wikimedia.org/P37576 and previous config saved to /var/cache/conftool/dbconfig/20221102-081313-marostegui.json [08:13:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852126 (owner: 10Marostegui) [08:13:45] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:852126|ProductionServices.php: Promote pc1014 to pc2 master]] [08:13:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:14:08] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:852126|ProductionServices.php: Promote pc1014 to pc2 master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37577 and previous config saved to /var/cache/conftool/dbconfig/20221102-081437-root.json [08:15:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1016.eqiad.wmnet with OS bullseye [08:16:01] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1016.eqiad.wmnet with OS bullseye [08:16:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:17:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:17:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:18:29] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:852126|ProductionServices.php: Promote pc1014 to pc2 master]] (duration: 04m 43s) [08:18:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:20:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [08:22:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (3) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:23:34] (03PS1) 10Marostegui: Revert "pc1012,pc1014: Promote pc1014 to master" [puppet] - 10https://gerrit.wikimedia.org/r/851024 [08:23:48] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851025 [08:23:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:24:15] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:24:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:25:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:28:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P37578 and previous config saved to /var/cache/conftool/dbconfig/20221102-082822-marostegui.json [08:28:45] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37902/console" [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:29:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After upgrade an incident', diff saved to https://phabricator.wikimedia.org/P37579 and previous config saved to /var/cache/conftool/dbconfig/20221102-082942-root.json [08:30:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1016.eqiad.wmnet with reason: host reimage [08:33:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1028.eqiad.wmnet [08:33:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1016.eqiad.wmnet with reason: host reimage [08:34:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [08:36:40] !log draining ganeti1020 for eventual reimage T311687 [08:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:56] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [08:41:29] (03CR) 10Marostegui: [C: 03+2] Revert "pc1012,pc1014: Promote pc1014 to master" [puppet] - 10https://gerrit.wikimedia.org/r/851024 (owner: 10Marostegui) [08:41:39] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851025 (owner: 10Marostegui) [08:42:10] (03PS1) 10Muehlenhoff: Point profile::contacts::role_contacts for clouddumps to WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/852127 [08:42:22] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851025 (owner: 10Marostegui) [08:42:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851025 (owner: 10Marostegui) [08:42:49] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:851025|Revert "ProductionServices.php: Promote pc1014 to pc2 master"]] [08:43:13] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:851025|Revert "ProductionServices.php: Promote pc1014 to pc2 master"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:43:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P37580 and previous config saved to /var/cache/conftool/dbconfig/20221102-084330-marostegui.json [08:45:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:45:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:45:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:45:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T318605)', diff saved to https://phabricator.wikimedia.org/P37581 and previous config saved to /var/cache/conftool/dbconfig/20221102-084540-ladsgroup.json [08:45:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:45:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:46:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:46:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:46:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:46:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [08:46:18] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:46:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [08:46:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [08:46:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [08:46:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [08:46:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1028.eqiad.wmnet [08:46:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37582 and previous config saved to /var/cache/conftool/dbconfig/20221102-084643-ladsgroup.json [08:46:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [08:46:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [08:46:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T318955)', diff saved to https://phabricator.wikimedia.org/P37583 and previous config saved to /var/cache/conftool/dbconfig/20221102-084653-ladsgroup.json [08:47:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:47:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:47:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [08:47:05] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:851025|Revert "ProductionServices.php: Promote pc1014 to pc2 master"]] (duration: 04m 16s) [08:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318950)', diff saved to https://phabricator.wikimedia.org/P37584 and previous config saved to /var/cache/conftool/dbconfig/20221102-084713-ladsgroup.json [08:47:18] PROBLEM - Check systemd state on ganeti1028 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:35] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:47:47] (03PS4) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860) [08:47:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:48:52] (03PS1) 10Marostegui: pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/852128 [08:48:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37585 and previous config saved to /var/cache/conftool/dbconfig/20221102-084853-ladsgroup.json [08:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318955)', diff saved to https://phabricator.wikimedia.org/P37586 and previous config saved to /var/cache/conftool/dbconfig/20221102-084910-ladsgroup.json [08:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318950)', diff saved to https://phabricator.wikimedia.org/P37587 and previous config saved to /var/cache/conftool/dbconfig/20221102-084927-ladsgroup.json [08:49:29] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [08:49:54] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/852128 (owner: 10Marostegui) [08:50:02] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [08:50:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1016.eqiad.wmnet with OS bullseye [08:50:08] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1016.eqiad.wmnet with OS bullseye completed: - ganeti1016 (**PASS**) - Downtimed on... [08:53:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:53:46] (03CR) 10Phuedx: [C: 03+1] testwiki: Add mediawiki.visual_editor_feature_use stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [08:54:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:54:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:54:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:55:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [08:55:10] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:55:21] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:55:27] (03PS4) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860) [08:57:35] (03PS1) 10Filippo Giunchedi: acme_chief: use force to absent cert directory [puppet] - 10https://gerrit.wikimedia.org/r/852130 [08:58:03] (03PS4) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860) [08:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321123)', diff saved to https://phabricator.wikimedia.org/P37588 and previous config saved to /var/cache/conftool/dbconfig/20221102-085838-marostegui.json [08:58:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:58:44] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:58:46] !log repool ms-fe10-12 [08:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:59:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37589 and previous config saved to /var/cache/conftool/dbconfig/20221102-085903-marostegui.json [08:59:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:59:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:00:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37590 and previous config saved to /var/cache/conftool/dbconfig/20221102-090007-ladsgroup.json [09:00:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:00:41] RECOVERY - Check systemd state on ganeti1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37591 and previous config saved to /var/cache/conftool/dbconfig/20221102-090119-marostegui.json [09:01:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [09:02:32] (03PS1) 10Filippo Giunchedi: wikimedia.org: remove smokeping.w.o [dns] - 10https://gerrit.wikimedia.org/r/852132 (https://phabricator.wikimedia.org/T169860) [09:03:16] !log mvernon@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [09:04:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37592 and previous config saved to /var/cache/conftool/dbconfig/20221102-090404-ladsgroup.json [09:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37593 and previous config saved to /var/cache/conftool/dbconfig/20221102-090418-ladsgroup.json [09:04:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37594 and previous config saved to /var/cache/conftool/dbconfig/20221102-090435-ladsgroup.json [09:05:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1028.eqiad.wmnet to cluster eqiad and group C [09:06:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1028.eqiad.wmnet to cluster eqiad and group C [09:06:48] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918 (owner: 10Krinkle) [09:12:46] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:16:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37595 and previous config saved to /var/cache/conftool/dbconfig/20221102-091606-ladsgroup.json [09:16:15] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P37596 and previous config saved to /var/cache/conftool/dbconfig/20221102-091628-marostegui.json [09:17:33] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37597 and previous config saved to /var/cache/conftool/dbconfig/20221102-091912-ladsgroup.json [09:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37598 and previous config saved to /var/cache/conftool/dbconfig/20221102-091928-ladsgroup.json [09:19:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37599 and previous config saved to /var/cache/conftool/dbconfig/20221102-091942-ladsgroup.json [09:28:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet [09:29:15] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [09:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37600 and previous config saved to /var/cache/conftool/dbconfig/20221102-093115-ladsgroup.json [09:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P37601 and previous config saved to /var/cache/conftool/dbconfig/20221102-093135-marostegui.json [09:34:17] (03CR) 10Elukey: [C: 03+1] "Checked all the differences for ml-serve1002 (comparing also the new kubelet_config), and the config seems exactly the same as before (sem" [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm) [09:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37602 and previous config saved to /var/cache/conftool/dbconfig/20221102-093420-ladsgroup.json [09:34:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:34:31] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [09:34:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:34:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318955)', diff saved to https://phabricator.wikimedia.org/P37603 and previous config saved to /var/cache/conftool/dbconfig/20221102-093436-ladsgroup.json [09:34:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [09:34:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37604 and previous config saved to /var/cache/conftool/dbconfig/20221102-093443-ladsgroup.json [09:34:45] (03CR) 10Vgutierrez: [C: 03+2] aptrepo: Add thirdparty/haproxy26 [puppet] - 10https://gerrit.wikimedia.org/r/850416 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:34:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [09:34:53] (03PS2) 10Vgutierrez: aptrepo: Add thirdparty/haproxy26 [puppet] - 10https://gerrit.wikimedia.org/r/850416 (https://phabricator.wikimedia.org/T321775) [09:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318950)', diff saved to https://phabricator.wikimedia.org/P37605 and previous config saved to /var/cache/conftool/dbconfig/20221102-093453-ladsgroup.json [09:34:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [09:35:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T318955)', diff saved to https://phabricator.wikimedia.org/P37606 and previous config saved to /var/cache/conftool/dbconfig/20221102-093459-ladsgroup.json [09:35:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:35:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [09:35:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:35:17] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [09:35:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318950)', diff saved to https://phabricator.wikimedia.org/P37607 and previous config saved to /var/cache/conftool/dbconfig/20221102-093517-ladsgroup.json [09:36:03] (03Abandoned) 10Vgutierrez: smokeping: Use asw1-b12-drmrs instead of lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/822373 (owner: 10Vgutierrez) [09:36:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet [09:36:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37608 and previous config saved to /var/cache/conftool/dbconfig/20221102-093657-ladsgroup.json [09:37:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [09:37:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318955)', diff saved to https://phabricator.wikimedia.org/P37609 and previous config saved to /var/cache/conftool/dbconfig/20221102-093717-ladsgroup.json [09:37:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [09:37:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318950)', diff saved to https://phabricator.wikimedia.org/P37610 and previous config saved to /var/cache/conftool/dbconfig/20221102-093730-ladsgroup.json [09:40:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1016.eqiad.wmnet to cluster eqiad and group B [09:41:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1016.eqiad.wmnet to cluster eqiad and group B [09:45:15] (03PS3) 10Vgutierrez: cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) [09:46:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37611 and previous config saved to /var/cache/conftool/dbconfig/20221102-094622-ladsgroup.json [09:46:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37612 and previous config saved to /var/cache/conftool/dbconfig/20221102-094644-marostegui.json [09:46:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:46:55] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:46:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37613 and previous config saved to /var/cache/conftool/dbconfig/20221102-094709-marostegui.json [09:48:05] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:48:16] (03CR) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:48:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918 (owner: 10Krinkle) [09:49:22] (03Merged) 10jenkins-bot: docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918 (owner: 10Krinkle) [09:49:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37614 and previous config saved to /var/cache/conftool/dbconfig/20221102-094928-marostegui.json [09:49:43] (03CR) 10Vgutierrez: [C: 03+1] tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff) [09:50:00] (03PS2) 10Vgutierrez: cache::haproxy: Switch to HAProxy 2.6 on concurrency tracking instances [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) [09:50:57] (03PS4) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 [09:51:53] !log installing exim4 security updates [09:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37615 and previous config saved to /var/cache/conftool/dbconfig/20221102-095205-ladsgroup.json [09:52:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37616 and previous config saved to /var/cache/conftool/dbconfig/20221102-095225-ladsgroup.json [09:52:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37617 and previous config saved to /var/cache/conftool/dbconfig/20221102-095237-ladsgroup.json [09:55:07] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [09:55:47] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-videoscaler helmfile deployment - https://phabricator.wikimedia.org/T321899 (10Clement_Goubert) 05Open→03Stalled Following https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/850095/comment/af29135f_66a53696/ We still hav... [09:56:23] (03PS1) 10Majavah: Drop toolschecker checks for the wikilabels database [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) [09:57:38] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:10] (03CR) 10CI reject: [V: 04-1] Drop toolschecker checks for the wikilabels database [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [09:59:50] (03PS2) 10Majavah: Drop toolschecker checks for the wikilabels database [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) [10:01:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37618 and previous config saved to /var/cache/conftool/dbconfig/20221102-100133-ladsgroup.json [10:01:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:01:46] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:01:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:01:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37619 and previous config saved to /var/cache/conftool/dbconfig/20221102-100156-ladsgroup.json [10:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37620 and previous config saved to /var/cache/conftool/dbconfig/20221102-100408-ladsgroup.json [10:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P37621 and previous config saved to /var/cache/conftool/dbconfig/20221102-100438-marostegui.json [10:06:28] (03PS1) 10Muehlenhoff: Set profile::contacts::role_contacts for three additional dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/852137 [10:06:46] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) Current config isn't valid for HAProxy 2.6.6: `vgutierrez@deployment-cache-text07:~$ sudo -i haproxy -f /etc/haproxy/haproxy.cfg -f /etc/haproxy/conf.d -c [NOTICE] (1505... [10:07:10] (03PS3) 10Majavah: Drop toolschecker checks for the wikilabels database [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) [10:07:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37622 and previous config saved to /var/cache/conftool/dbconfig/20221102-100713-ladsgroup.json [10:07:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37623 and previous config saved to /var/cache/conftool/dbconfig/20221102-100736-ladsgroup.json [10:07:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37624 and previous config saved to /var/cache/conftool/dbconfig/20221102-100746-ladsgroup.json [10:07:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37904/console" [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [10:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:06] (03CR) 10FNegri: [C: 03+1] "'git grep wikilabels' shows a few more matches... perhaps some of those can be dropped as well? e.g. modules/role/manifests/wmcs/db/wikila" [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [10:10:47] (03CR) 10Majavah: [V: 03+1] Drop toolschecker checks for the wikilabels database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [10:10:58] (03PS2) 10Clément Goubert: admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) [10:12:59] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm) [10:13:25] (03CR) 10FNegri: [C: 03+2] Drop toolschecker checks for the wikilabels database [puppet] - 10https://gerrit.wikimedia.org/r/852135 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [10:13:47] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Ladsgroup) hmm, it can be that it's only available to super-admins. I can't say for sure. Anyway, doesn't matter. If you send the link to me, I can delete them for you. [10:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:21] (03CR) 10Clément Goubert: admin: add mw on kubernetes namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:16:30] (03PS2) 10Clément Goubert: hieradata: Add usernames for mw on k8s services [puppet] - 10https://gerrit.wikimedia.org/r/850094 (https://phabricator.wikimedia.org/T321786) [10:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37625 and previous config saved to /var/cache/conftool/dbconfig/20221102-101916-ladsgroup.json [10:19:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P37626 and previous config saved to /var/cache/conftool/dbconfig/20221102-101946-marostegui.json [10:21:49] (03CR) 10ArielGlenn: [C: 03+1] "Thumbs up from me." [puppet] - 10https://gerrit.wikimedia.org/r/852137 (owner: 10Muehlenhoff) [10:22:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37627 and previous config saved to /var/cache/conftool/dbconfig/20221102-102221-ladsgroup.json [10:22:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:22:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:22:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37628 and previous config saved to /var/cache/conftool/dbconfig/20221102-102233-ladsgroup.json [10:22:36] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318955)', diff saved to https://phabricator.wikimedia.org/P37629 and previous config saved to /var/cache/conftool/dbconfig/20221102-102243-ladsgroup.json [10:22:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [10:22:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318950)', diff saved to https://phabricator.wikimedia.org/P37630 and previous config saved to /var/cache/conftool/dbconfig/20221102-102256-ladsgroup.json [10:22:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:22:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [10:23:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [10:23:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [10:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T318955)', diff saved to https://phabricator.wikimedia.org/P37631 and previous config saved to /var/cache/conftool/dbconfig/20221102-102310-ladsgroup.json [10:23:11] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:23:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37632 and previous config saved to /var/cache/conftool/dbconfig/20221102-102320-ladsgroup.json [10:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37633 and previous config saved to /var/cache/conftool/dbconfig/20221102-102342-ladsgroup.json [10:23:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:58] (03CR) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [10:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318955)', diff saved to https://phabricator.wikimedia.org/P37634 and previous config saved to /var/cache/conftool/dbconfig/20221102-102527-ladsgroup.json [10:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37635 and previous config saved to /var/cache/conftool/dbconfig/20221102-102533-ladsgroup.json [10:28:45] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:04] 10SRE, 10Infrastructure-Foundations: Design and implement async LDAP operations - https://phabricator.wikimedia.org/T320427 (10SLyngshede-WMF) 05Open→03In progress a:03SLyngshede-WMF [10:30:07] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [10:31:11] (03PS1) 10Vgutierrez: haproxy: Produce valid configs for HAProxy 2.6.x [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) [10:32:39] (03Abandoned) 10Jbond: break sretest [puppet] - 10https://gerrit.wikimedia.org/r/851107 (owner: 10Jbond) [10:32:57] (03PS3) 10Jbond: aptrepo: Add component pyall [puppet] - 10https://gerrit.wikimedia.org/r/850093 [10:33:02] (03PS2) 10Vgutierrez: haproxy: Produce valid configs for HAProxy 2.6.x [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) [10:33:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:33:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:33:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [10:33:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:33:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:34:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T318605)', diff saved to https://phabricator.wikimedia.org/P37636 and previous config saved to /var/cache/conftool/dbconfig/20221102-103400-ladsgroup.json [10:34:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:34:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37637 and previous config saved to /var/cache/conftool/dbconfig/20221102-103424-ladsgroup.json [10:34:40] (03PS3) 10Vgutierrez: haproxy: Produce valid configs for HAProxy 2.6.x [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) [10:34:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37638 and previous config saved to /var/cache/conftool/dbconfig/20221102-103453-marostegui.json [10:34:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:34:59] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:35:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:35:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:35:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:35:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:35:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:35:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T321123)', diff saved to https://phabricator.wikimedia.org/P37639 and previous config saved to /var/cache/conftool/dbconfig/20221102-103555-marostegui.json [10:36:23] (03CR) 10Jbond: "This change will affect most if not all spec tests in the puppet repo so please make sure to run `bundle exec rake global:parallel_spec` t" [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [10:36:29] (03CR) 10Jbond: [C: 04-1] wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [10:36:50] (03CR) 10CI reject: [V: 04-1] haproxy: Produce valid configs for HAProxy 2.6.x [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [10:37:19] (03CR) 10Jbond: [C: 04-1] wmf spec tests: Update to test Bullseye/Xena (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [10:38:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P37640 and previous config saved to /var/cache/conftool/dbconfig/20221102-103851-ladsgroup.json [10:39:21] (03CR) 10Jbond: [C: 04-1] "Sorry for splitting the comments over there posts, this is probably the one you want ;)" [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [10:40:17] (03PS4) 10Vgutierrez: haproxy: Produce valid configs for HAProxy 2.6.x [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) [10:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37641 and previous config saved to /var/cache/conftool/dbconfig/20221102-104034-ladsgroup.json [10:40:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37642 and previous config saved to /var/cache/conftool/dbconfig/20221102-104042-ladsgroup.json [10:40:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [10:42:59] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe) [10:44:29] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37905/console" [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [10:44:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852127 (owner: 10Muehlenhoff) [10:44:43] (03CR) 10Jbond: [C: 03+2] aptrepo: Add component pyall [puppet] - 10https://gerrit.wikimedia.org/r/850093 (owner: 10Jbond) [10:45:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:18] 10SRE, 10Infrastructure-Foundations: Evaluate Striker codebase - https://phabricator.wikimedia.org/T319415 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [10:46:21] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [10:48:45] (JobUnavailable) firing: (13) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:48:59] (03CR) 10Hnowlan: [C: 03+2] Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:49:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37643 and previous config saved to /var/cache/conftool/dbconfig/20221102-104932-ladsgroup.json [10:49:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:49:38] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:49:38] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Produce valid configs for HAProxy 2.6.x [puppet] - 10https://gerrit.wikimedia.org/r/852141 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [10:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318605)', diff saved to https://phabricator.wikimedia.org/P37644 and previous config saved to /var/cache/conftool/dbconfig/20221102-104942-ladsgroup.json [10:49:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:49:55] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:49:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10TAndic) Commenting approval as @HShaath-WMF 's direct manager. [10:53:45] (JobUnavailable) firing: (15) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:54] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37906/console" [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [10:54:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P37645 and previous config saved to /var/cache/conftool/dbconfig/20221102-105400-ladsgroup.json [10:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37646 and previous config saved to /var/cache/conftool/dbconfig/20221102-105544-ladsgroup.json [10:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37647 and previous config saved to /var/cache/conftool/dbconfig/20221102-105551-ladsgroup.json [10:57:20] !log depool cp1075, cp2027 and cp3050 prior to HAProxy 2.6 upgrade - T321775 [10:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:27] 10SRE, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) We need two databases, one for production and one for staging. [10:57:35] T321775: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 [10:58:24] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:45] (JobUnavailable) firing: (17) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:00:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:37] (03Merged) 10jenkins-bot: Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:00:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Switch to HAProxy 2.6 on concurrency tracking instances [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [11:02:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:03:06] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:03:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:03:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37648 and previous config saved to /var/cache/conftool/dbconfig/20221102-110314-ladsgroup.json [11:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:04:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [11:04:29] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:04:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P37649 and previous config saved to /var/cache/conftool/dbconfig/20221102-110451-ladsgroup.json [11:05:06] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:06:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:24] (03PS1) 10Jbond: statistics::rsyncd: use nogroup for gid instead of nobody [puppet] - 10https://gerrit.wikimedia.org/r/852145 (https://phabricator.wikimedia.org/T322149) [11:09:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37650 and previous config saved to /var/cache/conftool/dbconfig/20221102-110909-ladsgroup.json [11:09:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:09:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:09:30] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:09:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T318950)', diff saved to https://phabricator.wikimedia.org/P37651 and previous config saved to /var/cache/conftool/dbconfig/20221102-110931-ladsgroup.json [11:10:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318955)', diff saved to https://phabricator.wikimedia.org/P37652 and previous config saved to /var/cache/conftool/dbconfig/20221102-111051-ladsgroup.json [11:10:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [11:10:59] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:11:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37653 and previous config saved to /var/cache/conftool/dbconfig/20221102-111059-ladsgroup.json [11:11:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [11:11:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [11:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37654 and previous config saved to /var/cache/conftool/dbconfig/20221102-111113-ladsgroup.json [11:11:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [11:11:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [11:11:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [11:11:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:11:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318950)', diff saved to https://phabricator.wikimedia.org/P37655 and previous config saved to /var/cache/conftool/dbconfig/20221102-111141-ladsgroup.json [11:11:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318950)', diff saved to https://phabricator.wikimedia.org/P37656 and previous config saved to /var/cache/conftool/dbconfig/20221102-111147-ladsgroup.json [11:13:28] !log pool cp1075, cp2027 and cp3050 running HAProxy 2.6.6 - T321775 [11:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37657 and previous config saved to /var/cache/conftool/dbconfig/20221102-111331-ladsgroup.json [11:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:00] T321775: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 [11:14:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318950)', diff saved to https://phabricator.wikimedia.org/P37658 and previous config saved to /var/cache/conftool/dbconfig/20221102-111400-ladsgroup.json [11:15:15] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:16:02] (03PS1) 10Jbond: nodegen: fix title parsing used by auto [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852147 [11:17:33] (03CR) 10Jbond: [C: 03+2] statistics::rsyncd: use nogroup for gid instead of nobody [puppet] - 10https://gerrit.wikimedia.org/r/852145 (https://phabricator.wikimedia.org/T322149) (owner: 10Jbond) [11:19:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37659 and previous config saved to /var/cache/conftool/dbconfig/20221102-111911-ladsgroup.json [11:19:23] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P37660 and previous config saved to /var/cache/conftool/dbconfig/20221102-111958-ladsgroup.json [11:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P37661 and previous config saved to /var/cache/conftool/dbconfig/20221102-112008-ladsgroup.json [11:22:14] (03CR) 10Jbond: [C: 03+2] nodegen: fix title parsing used by auto [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852147 (owner: 10Jbond) [11:22:17] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/852148 [11:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321123)', diff saved to https://phabricator.wikimedia.org/P37662 and previous config saved to /var/cache/conftool/dbconfig/20221102-112217-marostegui.json [11:22:34] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:23:59] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:24:12] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/852148 (owner: 10Jbond) [11:24:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:24:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:26:43] (03PS1) 10JMeybohm: kubelet: Re-enable readOnlyPort (tcp/10255) for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/852150 (https://phabricator.wikimedia.org/T300499) [11:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P37663 and previous config saved to /var/cache/conftool/dbconfig/20221102-112648-ladsgroup.json [11:27:03] (03CR) 10Vgutierrez: [C: 03+1] wikimedia.org: remove smokeping.w.o [dns] - 10https://gerrit.wikimedia.org/r/852132 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [11:27:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 111 probes of 692 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:27:40] (03CR) 10Filippo Giunchedi: [C: 03+1] kubelet: Re-enable readOnlyPort (tcp/10255) for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/852150 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [11:28:32] (03CR) 10JMeybohm: [C: 03+2] kubelet: Re-enable readOnlyPort (tcp/10255) for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/852150 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [11:28:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37664 and previous config saved to /var/cache/conftool/dbconfig/20221102-112839-ladsgroup.json [11:28:46] (03PS1) 10Marostegui: db-production: Disable writes one es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852151 (https://phabricator.wikimedia.org/T322181) [11:29:08] (03CR) 10Jbond: "sorry missed this but applied the same change ill abandon this one" [puppet] - 10https://gerrit.wikimedia.org/r/851661 (https://phabricator.wikimedia.org/T322149) (owner: 10Andrew Bogott) [11:29:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37665 and previous config saved to /var/cache/conftool/dbconfig/20221102-112909-ladsgroup.json [11:29:12] (03Abandoned) 10Jbond: rsyncd.pp: use gid 'nogroup' rather than 'nobody' [puppet] - 10https://gerrit.wikimedia.org/r/851661 (https://phabricator.wikimedia.org/T322149) (owner: 10Andrew Bogott) [11:30:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37908/console" [puppet] - 10https://gerrit.wikimedia.org/r/852130 (owner: 10Filippo Giunchedi) [11:31:09] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) DCs using the Let's Encrypt cert have the wikifunctions... [11:31:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37909/console" [puppet] - 10https://gerrit.wikimedia.org/r/852130 (owner: 10Filippo Giunchedi) [11:33:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 61 probes of 692 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:33:45] (JobUnavailable) firing: (17) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:34:13] (03CR) 10Ladsgroup: [C: 03+1] db-production: Disable writes one es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852151 (https://phabricator.wikimedia.org/T322181) (owner: 10Marostegui) [11:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37666 and previous config saved to /var/cache/conftool/dbconfig/20221102-113419-ladsgroup.json [11:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318605)', diff saved to https://phabricator.wikimedia.org/P37667 and previous config saved to /var/cache/conftool/dbconfig/20221102-113506-ladsgroup.json [11:35:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [11:35:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P37668 and previous config saved to /var/cache/conftool/dbconfig/20221102-113515-ladsgroup.json [11:35:23] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:35:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [11:35:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T318605)', diff saved to https://phabricator.wikimedia.org/P37669 and previous config saved to /var/cache/conftool/dbconfig/20221102-113542-ladsgroup.json [11:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P37670 and previous config saved to /var/cache/conftool/dbconfig/20221102-113726-marostegui.json [11:38:07] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:38:27] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: use force to absent cert directory [puppet] - 10https://gerrit.wikimedia.org/r/852130 (owner: 10Filippo Giunchedi) [11:38:45] (JobUnavailable) firing: (17) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:39:00] (JobUnavailable) firing: (17) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:40:00] jouncebot: next [11:40:00] In 1 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T1300) [11:40:10] (03CR) 10Marostegui: [C: 03+2] db-production: Disable writes one es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852151 (https://phabricator.wikimedia.org/T322181) (owner: 10Marostegui) [11:40:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T322181 [11:40:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T322181 [11:40:52] (03Merged) 10jenkins-bot: db-production: Disable writes one es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852151 (https://phabricator.wikimedia.org/T322181) (owner: 10Marostegui) [11:41:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1021 with weight 0 T322181', diff saved to https://phabricator.wikimedia.org/P37671 and previous config saved to /var/cache/conftool/dbconfig/20221102-114107-root.json [11:41:58] T322181: Switchover es4 master (es1020 -> es1021) - https://phabricator.wikimedia.org/T322181 [11:41:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P37672 and previous config saved to /var/cache/conftool/dbconfig/20221102-114157-ladsgroup.json [11:42:09] (03PS3) 10Vgutierrez: Add wikimediaenteprise.com as a ncredir domain [dns] - 10https://gerrit.wikimedia.org/r/850167 (https://phabricator.wikimedia.org/T321804) [11:42:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852151 (https://phabricator.wikimedia.org/T322181) (owner: 10Marostegui) [11:42:39] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:852151|db-production: Disable writes one es4 (T322181)]] [11:43:03] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:852151|db-production: Disable writes one es4 (T322181)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [11:43:27] (03PS1) 10Marostegui: mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/852157 (https://phabricator.wikimedia.org/T322181) [11:43:45] (JobUnavailable) firing: (17) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:43:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37673 and previous config saved to /var/cache/conftool/dbconfig/20221102-114347-ladsgroup.json [11:43:49] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10serviceops, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10jbond) [11:44:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37674 and previous config saved to /var/cache/conftool/dbconfig/20221102-114416-ladsgroup.json [11:45:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:02] (03CR) 10Vgutierrez: [C: 03+2] Add wikimediaenteprise.com as a ncredir domain [dns] - 10https://gerrit.wikimedia.org/r/850167 (https://phabricator.wikimedia.org/T321804) (owner: 10Vgutierrez) [11:47:22] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:852151|db-production: Disable writes one es4 (T322181)]] (duration: 04m 43s) [11:47:31] T322181: Switchover es4 master (es1020 -> es1021) - https://phabricator.wikimedia.org/T322181 [11:47:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:48:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:48:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:48:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1021 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/852157 (https://phabricator.wikimedia.org/T322181) (owner: 10Marostegui) [11:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37675 and previous config saved to /var/cache/conftool/dbconfig/20221102-114927-ladsgroup.json [11:49:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:50:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318605)', diff saved to https://phabricator.wikimedia.org/P37676 and previous config saved to /var/cache/conftool/dbconfig/20221102-115023-ladsgroup.json [11:50:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:50:32] (03PS1) 10JMeybohm: Rename ml_k8s staging roles to match naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/852158 [11:50:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:50:46] (03PS1) 10Vgutierrez: Rename wikimediaenteprise.com zone to wikimediaenterprise.com [dns] - 10https://gerrit.wikimedia.org/r/852159 (https://phabricator.wikimedia.org/T321804) [11:51:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:51:47] (03CR) 10Vgutierrez: [C: 03+2] Rename wikimediaenteprise.com zone to wikimediaenterprise.com [dns] - 10https://gerrit.wikimedia.org/r/852159 (https://phabricator.wikimedia.org/T321804) (owner: 10Vgutierrez) [11:52:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P37677 and previous config saved to /var/cache/conftool/dbconfig/20221102-115233-marostegui.json [11:52:36] !log Starting es4 eqiad failover from es1020 to es1021 - T322181 [11:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:46] T322181: Switchover es4 master (es1020 -> es1021) - https://phabricator.wikimedia.org/T322181 [11:53:04] (03CR) 10Vgutierrez: [C: 03+1] "VTCs are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [11:53:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1021 to es4 primary T322181', diff saved to https://phabricator.wikimedia.org/P37678 and previous config saved to /var/cache/conftool/dbconfig/20221102-115313-root.json [11:54:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020', diff saved to https://phabricator.wikimedia.org/P37679 and previous config saved to /var/cache/conftool/dbconfig/20221102-115448-root.json [11:57:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318950)', diff saved to https://phabricator.wikimedia.org/P37680 and previous config saved to /var/cache/conftool/dbconfig/20221102-115705-ladsgroup.json [11:57:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:57:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:57:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:57:26] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:57:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [11:57:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:57:51] (03CR) 10Vgutierrez: [C: 04-1] prometheus: Handle inactive trafficserver service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [11:57:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:58:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T318950)', diff saved to https://phabricator.wikimedia.org/P37681 and previous config saved to /var/cache/conftool/dbconfig/20221102-115802-ladsgroup.json [11:58:15] (03PS1) 10Marostegui: wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/852160 (https://phabricator.wikimedia.org/T322181) [11:58:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37682 and previous config saved to /var/cache/conftool/dbconfig/20221102-115855-ladsgroup.json [11:58:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [11:59:22] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/852160 (https://phabricator.wikimedia.org/T322181) (owner: 10Marostegui) [11:59:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318950)', diff saved to https://phabricator.wikimedia.org/P37683 and previous config saved to /var/cache/conftool/dbconfig/20221102-115925-ladsgroup.json [11:59:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:59:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [11:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T318955)', diff saved to https://phabricator.wikimedia.org/P37684 and previous config saved to /var/cache/conftool/dbconfig/20221102-115940-ladsgroup.json [11:59:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:59:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:59:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37685 and previous config saved to /var/cache/conftool/dbconfig/20221102-115948-ladsgroup.json [12:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318950)', diff saved to https://phabricator.wikimedia.org/P37686 and previous config saved to /var/cache/conftool/dbconfig/20221102-120013-ladsgroup.json [12:00:44] (03CR) 10Vgutierrez: [C: 04-1] prometheus: Rename ats_ metrics to trafficserver_ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [12:01:36] (03PS1) 10Marostegui: Revert "db-production: Disable writes one es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852168 [12:01:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318955)', diff saved to https://phabricator.wikimedia.org/P37687 and previous config saved to /var/cache/conftool/dbconfig/20221102-120157-ladsgroup.json [12:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37688 and previous config saved to /var/cache/conftool/dbconfig/20221102-120209-ladsgroup.json [12:02:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add some weight to es4 master', diff saved to https://phabricator.wikimedia.org/P37689 and previous config saved to /var/cache/conftool/dbconfig/20221102-120233-marostegui.json [12:02:48] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:03:17] (03CR) 10Marostegui: [C: 03+2] Revert "db-production: Disable writes one es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852168 (owner: 10Marostegui) [12:03:19] (03PS1) 10Marostegui: es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/852164 [12:03:34] (03PS1) 10Hnowlan: api-gateway: expose restbase /api/ endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) [12:04:15] (03Merged) 10jenkins-bot: Revert "db-production: Disable writes one es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852168 (owner: 10Marostegui) [12:04:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37690 and previous config saved to /var/cache/conftool/dbconfig/20221102-120436-ladsgroup.json [12:04:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:04:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:04:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:04:55] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:04:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:05:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T318955)', diff saved to https://phabricator.wikimedia.org/P37691 and previous config saved to /var/cache/conftool/dbconfig/20221102-120505-ladsgroup.json [12:05:09] (03CR) 10Marostegui: [C: 03+2] es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/852164 (owner: 10Marostegui) [12:05:45] !log marostegui@deploy1002 Backport cancelled. [12:05:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852168 (owner: 10Marostegui) [12:06:16] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:852168|Revert "db-production: Disable writes one es4"]] [12:06:40] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:852168|Revert "db-production: Disable writes one es4"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [12:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321123)', diff saved to https://phabricator.wikimedia.org/P37692 and previous config saved to /var/cache/conftool/dbconfig/20221102-120742-marostegui.json [12:07:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [12:07:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1185.eqiad.wmnet with reason: Maintenance [12:08:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T321123)', diff saved to https://phabricator.wikimedia.org/P37693 and previous config saved to /var/cache/conftool/dbconfig/20221102-120805-marostegui.json [12:08:07] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) >>! In T320955#8351639, @wiki_willy wrote: > For everything that gets deleted in Netbox, is there any feature or anything that could pull that information upon delet... [12:09:17] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:10:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321123)', diff saved to https://phabricator.wikimedia.org/P37694 and previous config saved to /var/cache/conftool/dbconfig/20221102-121020-marostegui.json [12:10:54] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:852168|Revert "db-production: Disable writes one es4"]] (duration: 04m 37s) [12:11:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:11:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:11:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:15:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P37695 and previous config saved to /var/cache/conftool/dbconfig/20221102-121521-ladsgroup.json [12:15:44] (03PS2) 10Hnowlan: thumbor: don't manage thumbor.key within Helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/851609 (https://phabricator.wikimedia.org/T233196) [12:15:52] 10SRE, 10Domains: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10jbond) 05Open→03Resolved a:03jbond This task dosen't seem actionable and there have been no updates for some time, as such im going to close this but please feel free to update if there is any updat... [12:16:12] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) p:05Triage→03Medium Do you have any expected size/growth for those databases? Amount of writes/reads (roughly)? Also, the hosts should be able to connect... [12:16:50] (03PS1) 10Marostegui: Revert "es1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/852169 [12:17:04] (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for three additional dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/852137 (owner: 10Muehlenhoff) [12:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37696 and previous config saved to /var/cache/conftool/dbconfig/20221102-121704-ladsgroup.json [12:17:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37697 and previous config saved to /var/cache/conftool/dbconfig/20221102-121716-ladsgroup.json [12:17:54] (03CR) 10Marostegui: [C: 03+2] Revert "es1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/852169 (owner: 10Marostegui) [12:18:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P37698 and previous config saved to /var/cache/conftool/dbconfig/20221102-121812-root.json [12:18:49] moritzm: ok to merge? [12:19:01] please do! [12:19:06] done! [12:21:56] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) Growth is expected to be very low, as in 10 - 20 MB per month at the most. Similarly writes/reads will be pretty low as well as users will only need to in... [12:22:38] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) That seems fine indeed. How many users would you need? One with all privileges? one for writes and another one for reads? [12:23:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318955)', diff saved to https://phabricator.wikimedia.org/P37699 and previous config saved to /var/cache/conftool/dbconfig/20221102-122356-ladsgroup.json [12:24:04] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:24:39] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) Just one user with all privileges. The application is based on Django and will need to be able to manage schema migration internally. I don't think it mak... [12:24:50] (03PS1) 10Clément Goubert: admin: Remove stale mwdebug stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/852186 (https://phabricator.wikimedia.org/T321201) [12:25:03] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) But different users for the production and stage databases. [12:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P37700 and previous config saved to /var/cache/conftool/dbconfig/20221102-122529-marostegui.json [12:25:57] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) >>! In T320426#8362747, @SLyngshede-WMF wrote: > But different users for the production and stage databases. That makes sense. >>! In T320426#8362744, @... [12:27:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (3) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P37701 and previous config saved to /var/cache/conftool/dbconfig/20221102-123029-ladsgroup.json [12:30:41] 10SRE, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10jbond) 05Open→03Resolved a:03jbond >>! In T200690#8265133, @dancy wrote: > @Tgr Can you confirm that this is still a problem? As there has been no response and the fact this task is now ~4 years old im g... [12:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37704 and previous config saved to /var/cache/conftool/dbconfig/20221102-123213-ladsgroup.json [12:32:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37705 and previous config saved to /var/cache/conftool/dbconfig/20221102-123224-ladsgroup.json [12:33:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 3%: After reboot', diff saved to https://phabricator.wikimedia.org/P37706 and previous config saved to /var/cache/conftool/dbconfig/20221102-123319-root.json [12:36:02] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) The documentation mostly says "all", but GRANT CREATE, ALTER, INDEX, SELECT, UPDATE, INSERT, DELETE, REFERENCES should be the minimum. Django do prefer... [12:36:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:37] 10SRE: Data retention: revise audit bash scripts - https://phabricator.wikimedia.org/T111021 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work still needs to happen please reach out t... [12:36:54] 10SRE: Retention auditing: clean up rules db contents and use - https://phabricator.wikimedia.org/T111020 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work still needs to happen pleas... [12:37:07] 10SRE: Update Server Access Responsibilities document for Data Retention policy - https://phabricator.wikimedia.org/T83525 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work still need... [12:37:11] 10SRE: Implement MOTD warning for handling private data for shell users on (all?) systems - https://phabricator.wikimedia.org/T83527 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work... [12:37:25] 10SRE: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work still needs to happen please reach out... [12:37:27] 10SRE, 10audits-data-retention: Implement Data Retention Guidelines - https://phabricator.wikimedia.org/T83531 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work still needs to happe... [12:37:31] 10SRE, 10audits-data-retention: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839 (10LSobanski) Resolving after a conversation with @MoritzMuehlenhoff and @ArielGlenn. This task is old enough to not be actionable in its current state. If you feel like this work sti... [12:38:23] 10SRE: Data retention: revise audit bash scripts - https://phabricator.wikimedia.org/T111021 (10LSobanski) 05Open→03Resolved [12:38:25] 10SRE: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066 (10LSobanski) [12:38:33] 10SRE: Retention auditing: clean up rules db contents and use - https://phabricator.wikimedia.org/T111020 (10LSobanski) 05Open→03Resolved [12:38:35] 10SRE: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066 (10LSobanski) [12:38:42] 10SRE: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066 (10LSobanski) 05Open→03Resolved [12:38:44] 10SRE, 10audits-data-retention: Implement Data Retention Guidelines - https://phabricator.wikimedia.org/T83531 (10LSobanski) [12:38:52] 10SRE, 10audits-data-retention: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839 (10LSobanski) 05Open→03Resolved [12:38:59] 10SRE: Update Server Access Responsibilities document for Data Retention policy - https://phabricator.wikimedia.org/T83525 (10LSobanski) 05Open→03Resolved [12:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37707 and previous config saved to /var/cache/conftool/dbconfig/20221102-123906-ladsgroup.json [12:39:08] 10SRE: Implement MOTD warning for handling private data for shell users on (all?) systems - https://phabricator.wikimedia.org/T83527 (10LSobanski) 05Open→03Resolved [12:39:10] 10SRE, 10audits-data-retention: Implement Data Retention Guidelines - https://phabricator.wikimedia.org/T83531 (10LSobanski) [12:39:31] 10SRE, 10audits-data-retention: Implement Data Retention Guidelines - https://phabricator.wikimedia.org/T83531 (10LSobanski) 05Open→03Resolved [12:40:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P37708 and previous config saved to /var/cache/conftool/dbconfig/20221102-124037-marostegui.json [12:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318950)', diff saved to https://phabricator.wikimedia.org/P37709 and previous config saved to /var/cache/conftool/dbconfig/20221102-124537-ladsgroup.json [12:45:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:45:44] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:45:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:46:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T318950)', diff saved to https://phabricator.wikimedia.org/P37710 and previous config saved to /var/cache/conftool/dbconfig/20221102-124602-ladsgroup.json [12:46:17] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) That's ok - any preferred database(s) and user(s) name? [12:47:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318955)', diff saved to https://phabricator.wikimedia.org/P37711 and previous config saved to /var/cache/conftool/dbconfig/20221102-124720-ladsgroup.json [12:47:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:47:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37712 and previous config saved to /var/cache/conftool/dbconfig/20221102-124732-ladsgroup.json [12:47:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:47:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37713 and previous config saved to /var/cache/conftool/dbconfig/20221102-124743-ladsgroup.json [12:47:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:47:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:47:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37714 and previous config saved to /var/cache/conftool/dbconfig/20221102-124754-ladsgroup.json [12:48:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318950)', diff saved to https://phabricator.wikimedia.org/P37715 and previous config saved to /var/cache/conftool/dbconfig/20221102-124812-ladsgroup.json [12:48:23] (03PS2) 10Ssingh: Release 9.1.3-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) [12:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P37716 and previous config saved to /var/cache/conftool/dbconfig/20221102-124824-root.json [12:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37717 and previous config saved to /var/cache/conftool/dbconfig/20221102-125001-ladsgroup.json [12:50:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37718 and previous config saved to /var/cache/conftool/dbconfig/20221102-125009-ladsgroup.json [12:51:27] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:51:35] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:54:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37719 and previous config saved to /var/cache/conftool/dbconfig/20221102-125415-ladsgroup.json [12:54:30] 10SRE-swift-storage, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 4 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10jbond) [12:54:54] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Kelson) If you can delete the whole thread https://lists.wikimedia.org/hyperkitty/list/wikifr-l@lists.wikimedia.org/thread/J2FU23D4C5ERWIK2LWBYUBTYK3O6KP6Y/, then problem would be solved. It's a pity... [12:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321123)', diff saved to https://phabricator.wikimedia.org/P37720 and previous config saved to /var/cache/conftool/dbconfig/20221102-125544-marostegui.json [12:55:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:55:50] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:56:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:56:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T321123)', diff saved to https://phabricator.wikimedia.org/P37721 and previous config saved to /var/cache/conftool/dbconfig/20221102-125607-marostegui.json [12:57:32] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10jbond) [12:58:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321123)', diff saved to https://phabricator.wikimedia.org/P37722 and previous config saved to /var/cache/conftool/dbconfig/20221102-125840-marostegui.json [12:59:14] 10Puppet, 10Infrastructure-Foundations, 10WMF-General-or-Unknown, 10WMF-Legal, and 3 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10jbond) [12:59:22] !log draining ganeti1025 for eventual reimage T311687 [12:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:26] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [12:59:35] (03PS4) 10Clare Ming: testwiki: Add mediawiki.visual_editor_feature_use stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T1300) [13:00:05] cjming: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:35] I'm happy to self-deploy o/ [13:02:16] cjming: go ahead [13:02:30] cool - thanks [13:02:31] (I’m making lunch, so I’m not in a good position to deploy myself ^^) [13:03:05] 10SRE, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10jbond) Is there a more specific tag we can use for this instead of SRE? perhaps `serviceops-collab`? [13:03:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P37723 and previous config saved to /var/cache/conftool/dbconfig/20221102-130322-ladsgroup.json [13:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P37724 and previous config saved to /var/cache/conftool/dbconfig/20221102-130331-root.json [13:03:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [13:04:26] (03Merged) 10jenkins-bot: testwiki: Add mediawiki.visual_editor_feature_use stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [13:04:47] !log cjming@deploy1002 Started scap: Backport for [[gerrit:851723|testwiki: Add mediawiki.visual_editor_feature_use stream (T309602)]] [13:04:53] T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602 [13:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37725 and previous config saved to /var/cache/conftool/dbconfig/20221102-130509-ladsgroup.json [13:05:12] !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:851723|testwiki: Add mediawiki.visual_editor_feature_use stream (T309602)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:05:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37726 and previous config saved to /var/cache/conftool/dbconfig/20221102-130518-ladsgroup.json [13:06:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:19] (03PS2) 10Muehlenhoff: kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842761 (https://phabricator.wikimedia.org/T308013) [13:07:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:08:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:08:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318955)', diff saved to https://phabricator.wikimedia.org/P37727 and previous config saved to /var/cache/conftool/dbconfig/20221102-130923-ladsgroup.json [13:09:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:09:36] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:09:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:09:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:09:44] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851723|testwiki: Add mediawiki.visual_editor_feature_use stream (T309602)]] (duration: 04m 56s) [13:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T318955)', diff saved to https://phabricator.wikimedia.org/P37728 and previous config saved to /var/cache/conftool/dbconfig/20221102-130948-ladsgroup.json [13:11:16] T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602 [13:12:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318955)', diff saved to https://phabricator.wikimedia.org/P37729 and previous config saved to /var/cache/conftool/dbconfig/20221102-131159-ladsgroup.json [13:12:31] all done with my patch - no others in the queue [13:12:33] (03PS3) 10Vgutierrez: trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) [13:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P37730 and previous config saved to /var/cache/conftool/dbconfig/20221102-131348-marostegui.json [13:14:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:14:54] !log UTC afternoon backport+config window done [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:58] thanks cjming :) [13:15:10] np! [13:15:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:15:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:16:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:16:33] Hmm I hope the cronjob isn't unmaking the scap deployments for mw-debug k8s [13:18:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P37731 and previous config saved to /var/cache/conftool/dbconfig/20221102-131830-ladsgroup.json [13:18:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P37732 and previous config saved to /var/cache/conftool/dbconfig/20221102-131837-root.json [13:20:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37733 and previous config saved to /var/cache/conftool/dbconfig/20221102-132017-ladsgroup.json [13:20:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37734 and previous config saved to /var/cache/conftool/dbconfig/20221102-132025-ladsgroup.json [13:20:29] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:22:04] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) I'm think something like "idm" or identitymanager [13:22:45] !log disable puppet on A:cp before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/850087 - T321776 [13:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:49] T321776: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 [13:22:51] (03PS1) 10Muehlenhoff: Add component/puppetdb7 for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/852190 (https://phabricator.wikimedia.org/T321783) [13:23:53] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) We should probably go for something like: `idm` `idm_staging` And same for the database name. [13:24:21] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10SLyngshede-WMF) Seems ideal :-) [13:27:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37735 and previous config saved to /var/cache/conftool/dbconfig/20221102-132707-ladsgroup.json [13:27:48] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) Cool, I will try to get it in place tomorrow :) [13:27:55] 10SRE, 10DBA, 10Infrastructure-Foundations: Figure out where/how to store IDM internal data - https://phabricator.wikimedia.org/T320426 (10Marostegui) a:03Marostegui [13:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P37736 and previous config saved to /var/cache/conftool/dbconfig/20221102-132855-marostegui.json [13:29:01] (03CR) 10Muehlenhoff: [C: 03+2] kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842761 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:29:29] (03PS2) 10Muehlenhoff: Add component/puppetdb7 for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/852190 (https://phabricator.wikimedia.org/T321783) [13:30:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:38] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) (owner: 10Vgutierrez) [13:31:25] (03PS1) 10Volans: kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 [13:33:23] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 17 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:33:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318950)', diff saved to https://phabricator.wikimedia.org/P37737 and previous config saved to /var/cache/conftool/dbconfig/20221102-133338-ladsgroup.json [13:33:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:33:40] (03CR) 10CI reject: [V: 04-1] kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [13:33:43] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P37738 and previous config saved to /var/cache/conftool/dbconfig/20221102-133343-root.json [13:33:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:34:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T318950)', diff saved to https://phabricator.wikimedia.org/P37739 and previous config saved to /var/cache/conftool/dbconfig/20221102-133402-ladsgroup.json [13:34:21] !log vgutierrez@apt1001:~$ sudo -i reprepro clearvanished - T321776 [13:34:47] !log vgutierrez@apt1001:~$ sudo -i reprepro --delete clearvanished - T321776 [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:15] T321776: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 [13:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:18] that's some lag stashbot [13:35:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37740 and previous config saved to /var/cache/conftool/dbconfig/20221102-133526-ladsgroup.json [13:35:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [13:35:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318950)', diff saved to https://phabricator.wikimedia.org/P37741 and previous config saved to /var/cache/conftool/dbconfig/20221102-133533-ladsgroup.json [13:35:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [13:35:42] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:35:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [13:35:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [13:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T318955)', diff saved to https://phabricator.wikimedia.org/P37742 and previous config saved to /var/cache/conftool/dbconfig/20221102-133549-ladsgroup.json [13:35:53] (03PS2) 10Volans: kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 [13:35:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318950)', diff saved to https://phabricator.wikimedia.org/P37743 and previous config saved to /var/cache/conftool/dbconfig/20221102-133559-ladsgroup.json [13:36:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:36:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:36:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T318605)', diff saved to https://phabricator.wikimedia.org/P37744 and previous config saved to /var/cache/conftool/dbconfig/20221102-133637-ladsgroup.json [13:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318605)', diff saved to https://phabricator.wikimedia.org/P37745 and previous config saved to /var/cache/conftool/dbconfig/20221102-133733-ladsgroup.json [13:37:34] !log uploaded trafficserver 9.1.3-1wm2 to apt.wm.o (buster-wikimedia) - T321776 [13:38:01] (03CR) 10CI reject: [V: 04-1] kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [13:38:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318955)', diff saved to https://phabricator.wikimedia.org/P37746 and previous config saved to /var/cache/conftool/dbconfig/20221102-133807-ladsgroup.json [13:38:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318950)', diff saved to https://phabricator.wikimedia.org/P37747 and previous config saved to /var/cache/conftool/dbconfig/20221102-133819-ladsgroup.json [13:38:31] vgutierrez: be nice to the bot! :P [13:38:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:38:45] it didn't log my last one... [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:50] or at least not yet [13:39:05] dunno if Amir1 is flooding it with all those log messages [13:39:08] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:39:21] (03CR) 10Herron: [C: 03+1] dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:39:52] (03PS3) 10Muehlenhoff: Add component/puppetdb7 for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/852190 (https://phabricator.wikimedia.org/T321783) [13:40:08] (03PS3) 10Volans: kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 [13:40:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318950)', diff saved to https://phabricator.wikimedia.org/P37748 and previous config saved to /var/cache/conftool/dbconfig/20221102-134012-ladsgroup.json [13:42:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37749 and previous config saved to /var/cache/conftool/dbconfig/20221102-134216-ladsgroup.json [13:42:17] (03CR) 10CI reject: [V: 04-1] kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [13:43:14] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] acme_chief: use force to absent cert directory [puppet] - 10https://gerrit.wikimedia.org/r/852130 (owner: 10Filippo Giunchedi) [13:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321123)', diff saved to https://phabricator.wikimedia.org/P37750 and previous config saved to /var/cache/conftool/dbconfig/20221102-134404-marostegui.json [13:44:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:44:20] (03CR) 10Filippo Giunchedi: [C: 03+2] wikimedia.org: remove smokeping.w.o [dns] - 10https://gerrit.wikimedia.org/r/852132 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:44:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:44:24] (03PS2) 10Filippo Giunchedi: wikimedia.org: remove smokeping.w.o [dns] - 10https://gerrit.wikimedia.org/r/852132 (https://phabricator.wikimedia.org/T169860) [13:44:24] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:44:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:44:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:44:44] (03PS4) 10Volans: kafkatee::webrequests::ops: install stats script [puppet] - 10https://gerrit.wikimedia.org/r/852192 [13:45:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:45:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T321123)', diff saved to https://phabricator.wikimedia.org/P37751 and previous config saved to /var/cache/conftool/dbconfig/20221102-134527-marostegui.json [13:47:10] (03CR) 10Muehlenhoff: [C: 03+2] Add component/puppetdb7 for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/852190 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [13:47:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321123)', diff saved to https://phabricator.wikimedia.org/P37752 and previous config saved to /var/cache/conftool/dbconfig/20221102-134758-marostegui.json [13:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P37753 and previous config saved to /var/cache/conftool/dbconfig/20221102-134849-root.json [13:49:30] (03PS1) 10JMeybohm: Rename ml_k8s staging roles to match naming scheme [labs/private] - 10https://gerrit.wikimedia.org/r/852196 [13:50:12] (03CR) 10Volans: "This is a small script that uses gjson (the python version) to aggregate some data from sampled JSON logs." [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [13:50:28] (03CR) 10JMeybohm: "Depends on a private change like https://gerrit.wikimedia.org/r/c/labs/private/+/852196" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [13:50:47] (03CR) 10Ssingh: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:51:56] (03CR) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:52:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P37754 and previous config saved to /var/cache/conftool/dbconfig/20221102-135240-ladsgroup.json [13:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37755 and previous config saved to /var/cache/conftool/dbconfig/20221102-135315-ladsgroup.json [13:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37756 and previous config saved to /var/cache/conftool/dbconfig/20221102-135328-ladsgroup.json [13:53:59] !log import puppetdb 7.11.2-1 to component/puppetdb7 for bookworm-wikimedia T321783 [13:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:05] T321783: Setup an initial bookworm host with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [13:54:45] !log re-enabled puppet in A:cp - T321776 [13:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:58] T321776: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 [13:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P37757 and previous config saved to /var/cache/conftool/dbconfig/20221102-135521-ladsgroup.json [13:56:00] (03PS1) 10Milimetric: aqs: bump mediawiki history [puppet] - 10https://gerrit.wikimedia.org/r/852197 [13:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318955)', diff saved to https://phabricator.wikimedia.org/P37758 and previous config saved to /var/cache/conftool/dbconfig/20221102-135723-ladsgroup.json [13:57:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:57:31] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:57:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:57:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37759 and previous config saved to /var/cache/conftool/dbconfig/20221102-135746-ladsgroup.json [14:01:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318605)', diff saved to https://phabricator.wikimedia.org/P37760 and previous config saved to /var/cache/conftool/dbconfig/20221102-140133-ladsgroup.json [14:01:47] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:01:57] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Ladsgroup) Done. Deleted. [14:03:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P37761 and previous config saved to /var/cache/conftool/dbconfig/20221102-140307-marostegui.json [14:03:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P37762 and previous config saved to /var/cache/conftool/dbconfig/20221102-140355-root.json [14:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P37763 and previous config saved to /var/cache/conftool/dbconfig/20221102-140749-ladsgroup.json [14:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37764 and previous config saved to /var/cache/conftool/dbconfig/20221102-140822-ladsgroup.json [14:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37765 and previous config saved to /var/cache/conftool/dbconfig/20221102-140837-ladsgroup.json [14:09:58] (03PS1) 10Vgutierrez: acme-chief: Add wikimediaenterprise.com to ncredir certs [puppet] - 10https://gerrit.wikimedia.org/r/852201 (https://phabricator.wikimedia.org/T321804) [14:10:00] (03PS1) 10Vgutierrez: ncredir: Add wikimediaenterprise.com rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/852202 (https://phabricator.wikimedia.org/T321804) [14:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P37766 and previous config saved to /var/cache/conftool/dbconfig/20221102-141029-ladsgroup.json [14:14:28] 10SRE, 10Traffic, 10Patch-For-Review: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez ` vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'apt-cache policy trafficserver' 95 hosts will be targeted: cp[2027-2042].codfw.wmnet,cp[6001-6... [14:14:34] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Vgutierrez) [14:15:20] (03CR) 10Vgutierrez: [C: 03+2] acme-chief: Add wikimediaenterprise.com to ncredir certs [puppet] - 10https://gerrit.wikimedia.org/r/852201 (https://phabricator.wikimedia.org/T321804) (owner: 10Vgutierrez) [14:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P37767 and previous config saved to /var/cache/conftool/dbconfig/20221102-141640-ladsgroup.json [14:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P37768 and previous config saved to /var/cache/conftool/dbconfig/20221102-141815-marostegui.json [14:18:18] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:19:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37769 and previous config saved to /var/cache/conftool/dbconfig/20221102-141922-ladsgroup.json [14:19:32] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:19:57] (03PS1) 10Muehlenhoff: puppetdb: On bookworm install from component/puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) [14:20:47] (03PS2) 10Muehlenhoff: puppetdb: On bookworm install from component/puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) [14:21:23] (03CR) 10CI reject: [V: 04-1] puppetdb: On bookworm install from component/puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:22:21] (03PS1) 10Ssingh: team-traffic: drop VarnishTrafficDrop and HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/852206 [14:22:43] 10SRE, 10Traffic, 10Patch-For-Review: Enterprise redirect for wikimediaenterprise.com to enterprise.wikimedia.com - https://phabricator.wikimedia.org/T321804 (10Vgutierrez) 05Open→03Stalled p:05Triage→03Medium new certs have been issued for ncredir to handle wikimediaenterprise.com traffic, those wil... [14:22:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318605)', diff saved to https://phabricator.wikimedia.org/P37770 and previous config saved to /var/cache/conftool/dbconfig/20221102-142258-ladsgroup.json [14:23:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:23:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:23:16] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:23:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318955)', diff saved to https://phabricator.wikimedia.org/P37771 and previous config saved to /var/cache/conftool/dbconfig/20221102-142331-ladsgroup.json [14:23:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318950)', diff saved to https://phabricator.wikimedia.org/P37772 and previous config saved to /var/cache/conftool/dbconfig/20221102-142345-ladsgroup.json [14:23:52] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:25:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318950)', diff saved to https://phabricator.wikimedia.org/P37773 and previous config saved to /var/cache/conftool/dbconfig/20221102-142540-ladsgroup.json [14:25:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:25:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:26:00] (03PS4) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [14:26:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T318950)', diff saved to https://phabricator.wikimedia.org/P37774 and previous config saved to /var/cache/conftool/dbconfig/20221102-142605-ladsgroup.json [14:27:00] (03CR) 10Vgutierrez: [C: 04-1] Release 9.1.3-1wm3 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:27:05] (03PS1) 10JHathaway: aux-k8s: add dhcp config for workers [puppet] - 10https://gerrit.wikimedia.org/r/852207 (https://phabricator.wikimedia.org/T321137) [14:28:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318950)', diff saved to https://phabricator.wikimedia.org/P37775 and previous config saved to /var/cache/conftool/dbconfig/20221102-142818-ladsgroup.json [14:28:38] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add dhcp config for workers [puppet] - 10https://gerrit.wikimedia.org/r/852207 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [14:28:59] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:31:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P37776 and previous config saved to /var/cache/conftool/dbconfig/20221102-143150-ladsgroup.json [14:32:17] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:32:32] (03CR) 10Muehlenhoff: Release 9.1.3-1wm3 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:33:18] (03CR) 10Vgutierrez: [C: 03+1] team-traffic: drop VarnishTrafficDrop and HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/852206 (owner: 10Ssingh) [14:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321123)', diff saved to https://phabricator.wikimedia.org/P37777 and previous config saved to /var/cache/conftool/dbconfig/20221102-143324-marostegui.json [14:33:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:33:31] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:33:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:33:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T321123)', diff saved to https://phabricator.wikimedia.org/P37778 and previous config saved to /var/cache/conftool/dbconfig/20221102-143350-marostegui.json [14:34:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:34:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37779 and previous config saved to /var/cache/conftool/dbconfig/20221102-143430-ladsgroup.json [14:34:53] (03PS3) 10Ssingh: Release 9.1.3-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) [14:35:15] (03CR) 10Ssingh: "Updated debian/control and removed debian/compat." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:35:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321123)', diff saved to https://phabricator.wikimedia.org/P37780 and previous config saved to /var/cache/conftool/dbconfig/20221102-143522-marostegui.json [14:37:25] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [14:37:29] (03PS2) 10Ssingh: team-traffic: drop VarnishTrafficDrop and HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/852206 (https://phabricator.wikimedia.org/T322220) [14:37:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:38:32] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:41:29] (03CR) 10Vgutierrez: [C: 03+1] Release 9.1.3-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/849646 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:41:39] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [14:42:29] (03PS2) 10Ssingh: Release 6.0.10-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) [14:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P37781 and previous config saved to /var/cache/conftool/dbconfig/20221102-144325-ladsgroup.json [14:43:32] (03PS1) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [14:43:40] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [14:43:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [14:44:19] (03CR) 10CI reject: [V: 04-1] apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [14:44:23] (03PS1) 10JHathaway: aux-k8s: drop raid config for workers [puppet] - 10https://gerrit.wikimedia.org/r/852209 (https://phabricator.wikimedia.org/T321137) [14:45:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:36] !log installing ffmpeg security updates on bullseye [14:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318605)', diff saved to https://phabricator.wikimedia.org/P37782 and previous config saved to /var/cache/conftool/dbconfig/20221102-144657-ladsgroup.json [14:46:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:47:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:47:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:47:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37783 and previous config saved to /var/cache/conftool/dbconfig/20221102-144719-ladsgroup.json [14:48:03] (03PS2) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [14:48:45] (03CR) 10Elukey: [C: 03+1] Pin cert-manager and cfssl-issuer chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/838134 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [14:49:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37784 and previous config saved to /var/cache/conftool/dbconfig/20221102-144937-ladsgroup.json [14:49:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline, but feel free to ignore." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:50:12] (03CR) 10JHathaway: [C: 03+2] aux-k8s: drop raid config for workers [puppet] - 10https://gerrit.wikimedia.org/r/852209 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [14:50:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P37785 and previous config saved to /var/cache/conftool/dbconfig/20221102-145030-marostegui.json [14:51:10] (03PS1) 10Clément Goubert: apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) [14:51:14] (03CR) 10Ottomata: [C: 03+2] aqs: bump mediawiki history [puppet] - 10https://gerrit.wikimedia.org/r/852197 (owner: 10Milimetric) [14:51:16] (03CR) 10Elukey: "The names of the new templates have the .wikimedia.org suffix, never seen it elsewhere, do we want to keep it or am I missing some convent" [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [14:51:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:31] (03PS5) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [14:51:54] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Appledora - https://phabricator.wikimedia.org/T322222 (10MGerlach) [14:52:01] (03CR) 10Ssingh: Release 6.0.10-1wm2 (032 comments) [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:52:40] (03PS4) 10Jbond: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [14:52:50] (03PS1) 10Vgutierrez: cache::haproxy: Update cp305[12] to HAProxy 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/852211 (https://phabricator.wikimedia.org/T321775) [14:52:58] (03CR) 10Jbond: "i have gone through and addressed all the open comments, i think this is ready for another review" [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [14:53:01] (03CR) 10CI reject: [V: 04-1] thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [14:53:59] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [14:54:16] (03CR) 10Elukey: cfssl-issuer: Bump CRD chart version for cfssl-issuer update (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm) [14:54:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37918/console" [puppet] - 10https://gerrit.wikimedia.org/r/852211 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [14:54:42] (03PS2) 10Muehlenhoff: base: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850472 (https://phabricator.wikimedia.org/T308013) [14:54:58] (03PS2) 10Clément Goubert: apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) [14:55:20] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Update cp305[12] to HAProxy 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/852211 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [14:57:20] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37916/console" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P37786 and previous config saved to /var/cache/conftool/dbconfig/20221102-145833-ladsgroup.json [14:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:59:24] 10Puppet, 10Infrastructure-Foundations, 10WMF-General-or-Unknown, 10WMF-Legal, and 3 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) 05Open→03Declined I'm resolving this task in favour of https://phabricator.wikimedia.org/T308013. We're not... [15:01:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) [15:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:04:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318955)', diff saved to https://phabricator.wikimedia.org/P37787 and previous config saved to /var/cache/conftool/dbconfig/20221102-150444-ladsgroup.json [15:04:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:04:52] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:04:57] (03CR) 10CI reject: [V: 04-1] Release 6.0.10-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:05:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:05:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T318955)', diff saved to https://phabricator.wikimedia.org/P37788 and previous config saved to /var/cache/conftool/dbconfig/20221102-150508-ladsgroup.json [15:05:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P37789 and previous config saved to /var/cache/conftool/dbconfig/20221102-150538-marostegui.json [15:05:40] (03CR) 10Ssingh: "Varnish tests failing, which is expected." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:06:30] (03PS1) 10Ssingh: Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) [15:07:50] (03CR) 10CI reject: [V: 04-1] Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:09:04] (03PS1) 10JHathaway: aux-k8s: drop raid config for workers, attempt two [puppet] - 10https://gerrit.wikimedia.org/r/852213 (https://phabricator.wikimedia.org/T321137) [15:10:08] (03CR) 10JHathaway: [C: 03+2] aux-k8s: drop raid config for workers, attempt two [puppet] - 10https://gerrit.wikimedia.org/r/852213 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [15:13:05] (03CR) 10Ssingh: "recheck" [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:13:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318950)', diff saved to https://phabricator.wikimedia.org/P37790 and previous config saved to /var/cache/conftool/dbconfig/20221102-151341-ladsgroup.json [15:13:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [15:13:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [15:13:57] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:14:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T318950)', diff saved to https://phabricator.wikimedia.org/P37791 and previous config saved to /var/cache/conftool/dbconfig/20221102-151403-ladsgroup.json [15:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318950)', diff saved to https://phabricator.wikimedia.org/P37792 and previous config saved to /var/cache/conftool/dbconfig/20221102-151613-ladsgroup.json [15:16:41] (03PS2) 10Ssingh: Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) [15:17:55] (03PS3) 10Jbond: puppetdb: On bookworm install from component/puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [15:17:57] (03PS1) 10Jbond: rake_modules: only run module specific rake if we have files to run [puppet] - 10https://gerrit.wikimedia.org/r/852215 [15:18:47] (03PS1) 10Muehlenhoff: Unroll role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/852216 [15:18:50] (03CR) 10Hnowlan: [C: 03+2] thumbor: don't manage thumbor.key within Helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/851609 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:19:55] (03CR) 10Hashar: [C: 04-1] ci: move lists of contint and zuul hosts to hieradata/common.yaml (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [15:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318955)', diff saved to https://phabricator.wikimedia.org/P37793 and previous config saved to /var/cache/conftool/dbconfig/20221102-152012-ladsgroup.json [15:20:20] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321123)', diff saved to https://phabricator.wikimedia.org/P37794 and previous config saved to /var/cache/conftool/dbconfig/20221102-152045-marostegui.json [15:20:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:20:51] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:21:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:21:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:21:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [15:21:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T321123)', diff saved to https://phabricator.wikimedia.org/P37795 and previous config saved to /var/cache/conftool/dbconfig/20221102-152113-marostegui.json [15:21:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:19] (03CR) 10Jbond: [C: 03+2] rake_modules: only run module specific rake if we have files to run [puppet] - 10https://gerrit.wikimedia.org/r/852215 (owner: 10Jbond) [15:21:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [15:22:06] (03CR) 10Muehlenhoff: [C: 03+1] "One thing to keep in mind for the puppetisation: Bullseye by default has Varnish 6.5.1, so you need to deploy the packages using a higher " [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/849644 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:22:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321123)', diff saved to https://phabricator.wikimedia.org/P37796 and previous config saved to /var/cache/conftool/dbconfig/20221102-152244-marostegui.json [15:22:51] (03Merged) 10jenkins-bot: thumbor: don't manage thumbor.key within Helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/851609 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:23:11] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: On bookworm install from component/puppetdb7 [puppet] - 10https://gerrit.wikimedia.org/r/852205 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [15:24:33] (03CR) 10Muehlenhoff: [C: 03+2] base: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850472 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:25:01] (03PS1) 10Jbond: rake_modules: only run module specific rake if we have files to run [puppet] - 10https://gerrit.wikimedia.org/r/852219 [15:25:58] (03PS2) 10Jbond: rake_modules: only run module specific rake if we have files to run [puppet] - 10https://gerrit.wikimedia.org/r/852219 [15:26:11] (03CR) 10Jbond: [C: 03+2] rake_modules: only run module specific rake if we have files to run [puppet] - 10https://gerrit.wikimedia.org/r/852219 (owner: 10Jbond) [15:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P37797 and previous config saved to /var/cache/conftool/dbconfig/20221102-153121-ladsgroup.json [15:31:32] (03PS2) 10Elukey: ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/851093 (https://phabricator.wikimedia.org/T320374) [15:32:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [15:34:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852216 (owner: 10Muehlenhoff) [15:35:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37798 and previous config saved to /var/cache/conftool/dbconfig/20221102-153519-ladsgroup.json [15:35:51] !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [15:36:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:37:53] (03CR) 10Muehlenhoff: [C: 03+2] Unroll role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/852216 (owner: 10Muehlenhoff) [15:37:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P37799 and previous config saved to /var/cache/conftool/dbconfig/20221102-153754-marostegui.json [15:38:40] (03PS3) 10Ssingh: Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) [15:38:49] (03CR) 10Ssingh: Release 0.6.3 (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:43:00] (03PS1) 10Muehlenhoff: Retire generic insetup role [puppet] - 10https://gerrit.wikimedia.org/r/852223 [15:43:55] (03CR) 10Dzahn: [C: 03+1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [15:44:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:04] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) Thanks jbond, these are all legitimate points and must be addressed before we start to consider Ansible. Here's what I have so far: > lets not conflate debians abi... [15:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:30] (03CR) 10Jbond: [C: 03+1] "lgtm some comments inline but nothing blocking, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [15:46:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P37800 and previous config saved to /var/cache/conftool/dbconfig/20221102-154628-ladsgroup.json [15:47:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37801 and previous config saved to /var/cache/conftool/dbconfig/20221102-154736-ladsgroup.json [15:47:51] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:49:40] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/851093 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37802 and previous config saved to /var/cache/conftool/dbconfig/20221102-155026-ladsgroup.json [15:50:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:52:33] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:53:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P37803 and previous config saved to /var/cache/conftool/dbconfig/20221102-155302-marostegui.json [15:53:22] !log restarting blazegraph on wdqs1007 (BlazegraphFreeAllocatorsDecreasingRapidly) [15:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:55:45] (03PS1) 10Ssingh: Release 2.0.0-3 [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) [15:57:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:59:26] (03CR) 10Vgutierrez: [C: 03+1] Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:59:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:00:04] Daimona, HouseOfM, cmelo, and Amir1: Your horoscope predicts another unfortunate Create schema for the CampaignEvents extension deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T1600). [16:00:34] o/ [16:00:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/852234 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:00:47] o/ [16:00:50] I need a minute [16:01:03] (03PS1) 10Urbanecm: SpecialManageMentors: Do not include explanatory text on transclusion [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852172 (https://phabricator.wikimedia.org/T321773) [16:01:05] o/ [16:01:11] (03CR) 10Muehlenhoff: [C: 03+1] Release 0.6.3 [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/852212 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:01:17] (03PS1) 10Urbanecm: SpecialManageMentors: Do not include explanatory text on transclusion [extensions/GrowthExperiments] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/852173 (https://phabricator.wikimedia.org/T321773) [16:01:19] Sure [16:01:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318950)', diff saved to https://phabricator.wikimedia.org/P37804 and previous config saved to /var/cache/conftool/dbconfig/20221102-160136-ladsgroup.json [16:01:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:01:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:01:42] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:01:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski) @Cmjohnson what's the expected ETA for this host? Asking as contint1001 seems to be nearing the end of its life and we'd like to move ahead with the replacement as quick a... [16:02:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37805 and previous config saved to /var/cache/conftool/dbconfig/20221102-160243-ladsgroup.json [16:03:30] Daimona: (late?) fingers crossed on the CampaignsEvents project :) [16:03:53] Thank you :) Crossing my fingers as well. [16:04:03] (03CR) 10Daniel Kinzler: api-gateway: expose restbase /api/ endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [16:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318955)', diff saved to https://phabricator.wikimedia.org/P37806 and previous config saved to /var/cache/conftool/dbconfig/20221102-160537-ladsgroup.json [16:05:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:05:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:05:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:06:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T318955)', diff saved to https://phabricator.wikimedia.org/P37807 and previous config saved to /var/cache/conftool/dbconfig/20221102-160600-ladsgroup.json [16:06:16] (03CR) 10Dzahn: R:rsync::manifests::server::module: add type validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [16:06:27] (03CR) 10Dzahn: [C: 03+1] R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [16:06:58] !log installing glibc security updates on buster [16:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:15] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 28787400 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:08:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321123)', diff saved to https://phabricator.wikimedia.org/P37808 and previous config saved to /var/cache/conftool/dbconfig/20221102-160809-marostegui.json [16:08:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [16:08:13] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:08:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [16:08:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37809 and previous config saved to /var/cache/conftool/dbconfig/20221102-160834-marostegui.json [16:08:46] (03CR) 10Dzahn: [C: 03+1] rsync::server::module: drop auto_ferm_ipv6 parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [16:09:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 881168 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:09:26] (03CR) 10Dzahn: [C: 03+1] "in today's gitlab IC meeting this has been discussed and there were no concerns to go ahead with it" [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [16:10:01] (03CR) 10Dzahn: "thanks for doing all that:)" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [16:11:03] (03CR) 10Dzahn: "Hi @Jbond boldly adding you here because you are currently working on rsync and I wonder what you think there." [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [16:11:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37810 and previous config saved to /var/cache/conftool/dbconfig/20221102-161104-marostegui.json [16:12:37] !log Creating schema for the CampaignEvents extension on testwiki, test2wiki and officewiki # T318595 [16:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:44] T318595: Create database schema for the CampaignEvents extension on testwiki, test2wiki, and officewiki - https://phabricator.wikimedia.org/T318595 [16:15:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P37811 and previous config saved to /var/cache/conftool/dbconfig/20221102-161753-ladsgroup.json [16:17:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:18:12] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10taavi) [16:18:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:18:32] (03PS2) 10JMeybohm: Rename ml_k8s staging roles to match naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/852158 [16:18:34] (03PS1) 10JMeybohm: Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) [16:18:42] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:19:33] (03CR) 10CI reject: [V: 04-1] Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [16:19:48] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "thanks for deploying it:)" [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy) [16:20:13] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:21:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:32] (03PS2) 10JMeybohm: Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) [16:21:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:21:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:22:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:22:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37812 and previous config saved to /var/cache/conftool/dbconfig/20221102-162320-ladsgroup.json [16:23:59] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:26:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P37813 and previous config saved to /var/cache/conftool/dbconfig/20221102-162614-marostegui.json [16:26:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:26:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T318605)', diff saved to https://phabricator.wikimedia.org/P37814 and previous config saved to /var/cache/conftool/dbconfig/20221102-162629-ladsgroup.json [16:26:40] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:27:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (3) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:28:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:29:50] (03PS3) 10JMeybohm: Move kube-proxy config to file [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) [16:30:36] (03PS1) 10Hnowlan: kask, thumbor: update invalid base requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/852240 (https://phabricator.wikimedia.org/T233196) [16:31:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37920/console" [puppet] - 10https://gerrit.wikimedia.org/r/852237 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [16:31:51] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:32:13] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:33:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37815 and previous config saved to /var/cache/conftool/dbconfig/20221102-163300-ladsgroup.json [16:33:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:33:02] (03CR) 10Hnowlan: api-gateway: expose restbase /api/ endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852165 (https://phabricator.wikimedia.org/T322152) (owner: 10Hnowlan) [16:33:21] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:33:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:33:30] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:33:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T318605)', diff saved to https://phabricator.wikimedia.org/P37816 and previous config saved to /var/cache/conftool/dbconfig/20221102-163334-ladsgroup.json [16:33:47] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:34:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:35:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:35:25] the too many messages to kafka logging is me sorry :( [16:35:41] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:36:09] (03CR) 10JMeybohm: [C: 03+1] kask, thumbor: update invalid base requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/852240 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:36:21] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) Thanks for the response brian, in genral i think that ansible could be better and i think some of the points around puppet dying and the different strength of the c... [16:36:56] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:37:04] (03CR) 10Hnowlan: [C: 03+2] kask, thumbor: update invalid base requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/852240 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:37:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:38:10] (03CR) 10Clément Goubert: [C: 03+1] kask, thumbor: update invalid base requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/852240 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:38:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37817 and previous config saved to /var/cache/conftool/dbconfig/20221102-163829-ladsgroup.json [16:38:47] (03CR) 10Dzahn: "and I see in Horizon it's already gone. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/850541 (owner: 10Dzahn) [16:38:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:38:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:41:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P37818 and previous config saved to /var/cache/conftool/dbconfig/20221102-164123-marostegui.json [16:41:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:41:50] (03Merged) 10jenkins-bot: kask, thumbor: update invalid base requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/852240 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:42:26] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:43:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:43:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:44:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:45:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:46:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:46:25] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:46:50] completed the ml rollout :) [16:48:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:49:04] 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [16:49:24] 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) p:05Triage→03Medium [16:50:29] (03CR) 10Volans: "replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [16:52:28] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:53:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318955)', diff saved to https://phabricator.wikimedia.org/P37819 and previous config saved to /var/cache/conftool/dbconfig/20221102-165337-ladsgroup.json [16:53:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [16:53:42] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:53:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [16:53:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:54:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T318955)', diff saved to https://phabricator.wikimedia.org/P37820 and previous config saved to /var/cache/conftool/dbconfig/20221102-165400-ladsgroup.json [16:54:21] (03CR) 10BCornwall: [C: 03+1] team-traffic: drop VarnishTrafficDrop and HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/852206 (https://phabricator.wikimedia.org/T322220) (owner: 10Ssingh) [16:54:37] 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) Switch infor ` asw2-23-ulsfo (WMF7220) interface: xe-2/0/11 cable id: cp4052d vlan-id; 1211 ` [16:55:08] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10wiki_willy) a:05Clement_Goubert→03Jclark-ctr Confirmed with Alex that this one is ready for Dc-Ops now. Thanks, Willy [16:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318605)', diff saved to https://phabricator.wikimedia.org/P37821 and previous config saved to /var/cache/conftool/dbconfig/20221102-165523-ladsgroup.json [16:55:31] 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [16:56:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T318955)', diff saved to https://phabricator.wikimedia.org/P37822 and previous config saved to /var/cache/conftool/dbconfig/20221102-165614-ladsgroup.json [16:56:25] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:56:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37823 and previous config saved to /var/cache/conftool/dbconfig/20221102-165631-marostegui.json [16:56:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:56:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:56:41] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:56:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:56:54] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:56:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T321123)', diff saved to https://phabricator.wikimedia.org/P37824 and previous config saved to /var/cache/conftool/dbconfig/20221102-165656-marostegui.json [16:58:08] 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [16:59:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321123)', diff saved to https://phabricator.wikimedia.org/P37825 and previous config saved to /var/cache/conftool/dbconfig/20221102-165927-marostegui.json [17:00:19] 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) [17:02:20] 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) @LSobanski comment about the incident with contint1001 is at T294276#8357385 [17:03:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Dzahn) [17:03:26] 10SRE, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) I think this is currently blocked on T313830. [17:04:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:04:25] !log hashar@deploy1002 Started deploy [integration/docroot@8d2f4a0]: Remove .zuul-change font-weight - T322168 [17:04:36] !log hashar@deploy1002 Finished deploy [integration/docroot@8d2f4a0]: Remove .zuul-change font-weight - T322168 (duration: 00m 10s) [17:04:41] T322168: Update Zuul status page to WMUI (remove last bit of Bootstrap) - https://phabricator.wikimedia.org/T322168 [17:06:01] (03PS1) 10Dzahn: devtools/phabricator: add profile::phabricator::main::dumps_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/852244 [17:06:15] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:06:43] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [17:10:19] (03PS1) 10Vgutierrez: deployment-prep: Add ms-be0[78] as storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/852245 (https://phabricator.wikimedia.org/T322231) [17:10:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P37826 and previous config saved to /var/cache/conftool/dbconfig/20221102-171032-ladsgroup.json [17:11:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37827 and previous config saved to /var/cache/conftool/dbconfig/20221102-171122-ladsgroup.json [17:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P37828 and previous config saved to /var/cache/conftool/dbconfig/20221102-171436-marostegui.json [17:15:40] (03CR) 10Jbond: [C: 03+1] "ack thanks for the responses lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [17:16:52] (03CR) 10Dzahn: [C: 03+2] devtools/phabricator: add profile::phabricator::main::dumps_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/852244 (owner: 10Dzahn) [17:16:57] (03PS2) 10Dzahn: devtools/phabricator: add profile::phabricator::main::dumps_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/852244 [17:18:55] (03CR) 10Volans: [C: 03+2] "Thanks, merging." [puppet] - 10https://gerrit.wikimedia.org/r/852192 (owner: 10Volans) [17:20:20] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Sustainability (Incident Followup): Add basic alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 (10TheresNoTime) a:05TheresNoTime→03None [17:22:57] (03CR) 10Dzahn: [C: 03+2] "removed in Horizon. I don't expect we ever want to set dumps hosts inside the cloud VPS project. But if needed it can always be reverted h" [puppet] - 10https://gerrit.wikimedia.org/r/852244 (owner: 10Dzahn) [17:24:01] (03Abandoned) 10Samtar: update_version.py: Resolve PendingDeprecationWarning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar) [17:25:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P37829 and previous config saved to /var/cache/conftool/dbconfig/20221102-172540-ladsgroup.json [17:26:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37830 and previous config saved to /var/cache/conftool/dbconfig/20221102-172630-ladsgroup.json [17:26:31] (03CR) 10Samtar: [C: 03+1] "looks fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/852245 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [17:29:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P37831 and previous config saved to /var/cache/conftool/dbconfig/20221102-172944-marostegui.json [17:35:04] (03PS1) 10Dzahn: clouddumps/phabricator: rename rsync module to fix dumps sync [puppet] - 10https://gerrit.wikimedia.org/r/852252 (https://phabricator.wikimedia.org/T322221) [17:36:55] (03CR) 10Dzahn: "root@phab1001:/etc/rsync.d# ls" [puppet] - 10https://gerrit.wikimedia.org/r/852252 (https://phabricator.wikimedia.org/T322221) (owner: 10Dzahn) [17:37:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37921/clouddumps1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/852252 (https://phabricator.wikimedia.org/T322221) (owner: 10Dzahn) [17:38:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] clouddumps/phabricator: rename rsync module to fix dumps sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852252 (https://phabricator.wikimedia.org/T322221) (owner: 10Dzahn) [17:38:56] (03PS1) 10Clare Ming: Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) [17:39:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp4052.ulsfo.wmnet [17:39:52] (03CR) 10CI reject: [V: 04-1] Add config for Visual Editor Feature Use instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [17:40:17] !log clouddumps1002 - /usr/local/bin/dump-fetch-phabdumps.sh T322221 [17:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:37] T322221: The service unit dumps-fetch-phabdumps.service is in failed status on host clouddumps1002 - https://phabricator.wikimedia.org/T322221 [17:40:41] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "clouddumps1002:" [puppet] - 10https://gerrit.wikimedia.org/r/852252 (https://phabricator.wikimedia.org/T322221) (owner: 10Dzahn) [17:40:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318605)', diff saved to https://phabricator.wikimedia.org/P37832 and previous config saved to /var/cache/conftool/dbconfig/20221102-174048-ladsgroup.json [17:40:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:41:02] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:41:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:41:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T318605)', diff saved to https://phabricator.wikimedia.org/P37833 and previous config saved to /var/cache/conftool/dbconfig/20221102-174110-ladsgroup.json [17:41:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T318955)', diff saved to https://phabricator.wikimedia.org/P37834 and previous config saved to /var/cache/conftool/dbconfig/20221102-174138-ladsgroup.json [17:41:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:41:44] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:41:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:41:58] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [17:42:41] (03CR) 10Clare Ming: Add config for Visual Editor Feature Use instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852254 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [17:42:51] (03PS1) 10Papaul: change cp4052 role in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/852255 (https://phabricator.wikimedia.org/T322238) [17:43:24] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:44:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321123)', diff saved to https://phabricator.wikimedia.org/P37835 and previous config saved to /var/cache/conftool/dbconfig/20221102-174451-marostegui.json [17:44:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:44:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37836 and previous config saved to /var/cache/conftool/dbconfig/20221102-174504-marostegui.json [17:45:13] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:45:27] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:46:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4052.ulsfo.wmnet [17:46:37] 10SRE, 10ops-ulsfo, 10Patch-For-Review: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by pt1979@cumin2002 for hosts: `cp4052.ulsfo.wmnet` - cp4052.ulsfo.wmnet (**PASS**) - Downtimed host on Icinga/... [17:47:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37837 and previous config saved to /var/cache/conftool/dbconfig/20221102-174737-marostegui.json [17:52:01] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [17:53:02] (03PS1) 10Dzahn: dumps: datasets/fetcher, add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852256 [17:58:54] (03PS1) 10Volans: json-webrequests-stats: fix docstring escape [puppet] - 10https://gerrit.wikimedia.org/r/852257 [18:00:04] jeena and jnuche: gettimeofday() says it's time for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T1800) [18:00:05] jeena and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T1800). [18:01:50] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852258 (https://phabricator.wikimedia.org/T320513) [18:01:52] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852258 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [18:01:54] (03CR) 10Volans: [C: 03+2] json-webrequests-stats: fix docstring escape [puppet] - 10https://gerrit.wikimedia.org/r/852257 (owner: 10Volans) [18:02:05] (03PS1) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 [18:02:34] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852258 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [18:02:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P37838 and previous config saved to /var/cache/conftool/dbconfig/20221102-180245-marostegui.json [18:05:30] (03PS2) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [18:05:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:06:39] (03PS3) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [18:06:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) Note that {T316195} needs to happen. Perhaps both can be/will be accomplished during this time? [18:06:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:06:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:06:53] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.8 refs T320513 [18:07:13] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [18:07:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:09:10] (03CR) 10CI reject: [V: 04-1] dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:10:37] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.8 refs T320513 (duration: 03m 43s) [18:10:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318605)', diff saved to https://phabricator.wikimedia.org/P37839 and previous config saved to /var/cache/conftool/dbconfig/20221102-181039-ladsgroup.json [18:10:56] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:12:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:13:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:13:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:14:00] (03PS1) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [18:14:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:36] (03CR) 10Papaul: [C: 03+2] change cp4052 role in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/852255 (https://phabricator.wikimedia.org/T322238) (owner: 10Papaul) [18:16:43] (03PS4) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [18:17:00] (03CR) 10Dzahn: "btw:" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:17:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P37840 and previous config saved to /var/cache/conftool/dbconfig/20221102-181753-marostegui.json [18:17:58] 10SRE, 10ops-ulsfo, 10Patch-For-Review: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [18:19:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:38] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:24:01] 10SRE, 10ops-ulsfo, 10Patch-For-Review: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) issue 1: the decommission cookbook remove the switch configuration from the switch but not from Netbox [18:24:58] (03CR) 10Dzahn: [V: 03+1 C: 03+2] clouddumps/phabricator: rename rsync module to fix dumps sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852252 (https://phabricator.wikimedia.org/T322221) (owner: 10Dzahn) [18:25:34] (03PS1) 10Dzahn: phabricator: stop phab2001 from being an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/852261 (https://phabricator.wikimedia.org/T322250) [18:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P37841 and previous config saved to /var/cache/conftool/dbconfig/20221102-182548-ladsgroup.json [18:29:05] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37923/" [puppet] - 10https://gerrit.wikimedia.org/r/852261 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [18:29:32] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [18:30:26] (03CR) 10Dzahn: [V: 03+1] "as you can see in the compiler output above, this changes firewall rules and excludes phab2001 from rsync access on all other hosts. also " [puppet] - 10https://gerrit.wikimedia.org/r/852261 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [18:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318605)', diff saved to https://phabricator.wikimedia.org/P37842 and previous config saved to /var/cache/conftool/dbconfig/20221102-183031-ladsgroup.json [18:30:47] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321123)', diff saved to https://phabricator.wikimedia.org/P37843 and previous config saved to /var/cache/conftool/dbconfig/20221102-183305-marostegui.json [18:33:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:33:19] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [18:33:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T321123)', diff saved to https://phabricator.wikimedia.org/P37844 and previous config saved to /var/cache/conftool/dbconfig/20221102-183327-marostegui.json [18:34:00] PROBLEM - Host stat1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:34:12] PROBLEM - Host wcqs1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:34:12] PROBLEM - Host wdqs1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:35:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321123)', diff saved to https://phabricator.wikimedia.org/P37845 and previous config saved to /var/cache/conftool/dbconfig/20221102-183514-marostegui.json [18:35:17] (03PS1) 10Dzahn: phabricator: remove phab2001 from the list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/852264 (https://phabricator.wikimedia.org/T322250) [18:39:58] RECOVERY - Host stat1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:40:10] RECOVERY - Host wcqs1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:40:10] RECOVERY - Host wdqs1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [18:40:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Jclark-ctr) arclamp1001 B1 U40 cableID 23000021 port40 [18:40:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Jclark-ctr) [18:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P37846 and previous config saved to /var/cache/conftool/dbconfig/20221102-184056-ladsgroup.json [18:41:20] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Jclark-ctr) puppetdb1003 B1 U39 cableID 23000044 port39 [18:41:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:41:55] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Jclark-ctr) [18:42:11] (03PS2) 10Dzahn: phabricator: remove phab2001 from the list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/852264 (https://phabricator.wikimedia.org/T322250) [18:42:21] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:45:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P37847 and previous config saved to /var/cache/conftool/dbconfig/20221102-184538-ladsgroup.json [18:45:47] (03PS4) 10Jbond: directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 [18:45:49] (03PS6) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [18:46:16] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:18] (03PS1) 10Dzahn: phabricator: switch phab2001 to phab2002 in commented line [dns] - 10https://gerrit.wikimedia.org/r/852266 (https://phabricator.wikimedia.org/T322250) [18:47:31] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [18:47:38] (03CR) 10CI reject: [V: 04-1] directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [18:47:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson dbprov1004 D7 U31 cableID 4901 port19 [18:47:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Jclark-ctr) [18:48:22] (03PS1) 10JHathaway: aux-k8s: drop lvm config for workers, attempt three [puppet] - 10https://gerrit.wikimedia.org/r/852268 (https://phabricator.wikimedia.org/T321137) [18:50:13] (03CR) 10JHathaway: [C: 03+2] aux-k8s: drop lvm config for workers, attempt three [puppet] - 10https://gerrit.wikimedia.org/r/852268 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [18:50:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P37848 and previous config saved to /var/cache/conftool/dbconfig/20221102-185023-marostegui.json [18:50:25] (03PS1) 10Andrew Bogott: designate nova_fixed_multi: use _update_or_delete_recordset for record delete [puppet] - 10https://gerrit.wikimedia.org/r/852270 (https://phabricator.wikimedia.org/T305828) [18:51:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:12] (03CR) 10Herron: [C: 03+1] dispatch: refactor/simplify db profile [puppet] - 10https://gerrit.wikimedia.org/r/851693 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [18:51:20] (03CR) 10CI reject: [V: 04-1] designate nova_fixed_multi: use _update_or_delete_recordset for record delete [puppet] - 10https://gerrit.wikimedia.org/r/852270 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [18:51:41] (03PS1) 10Dzahn: rename varnish service alias for phab2001-aphlict? [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) [18:52:22] (03CR) 10Dzahn: "also see: https://gerrit.wikimedia.org/r/c/operations/dns/+/852272" [dns] - 10https://gerrit.wikimedia.org/r/852266 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [18:54:12] (03CR) 10Dzahn: "you are welcome Raymond. Since I don't really have much to do with the actual content of this patch and it was just a technical comment I " [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:54:34] (03PS7) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [18:56:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318605)', diff saved to https://phabricator.wikimedia.org/P37849 and previous config saved to /var/cache/conftool/dbconfig/20221102-185604-ladsgroup.json [18:56:05] (03CR) 10Dzahn: "or maybe this 'varnish puppetization woes' comment is not current anymore and nothing would happen if this is removed, I am not sure yet" [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [18:56:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [18:56:19] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:56:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [18:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T318605)', diff saved to https://phabricator.wikimedia.org/P37850 and previous config saved to /var/cache/conftool/dbconfig/20221102-185627-ladsgroup.json [18:57:03] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [18:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:59:33] (03PS8) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [19:00:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P37851 and previous config saved to /var/cache/conftool/dbconfig/20221102-190048-ladsgroup.json [19:01:18] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [19:03:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:25] (03CR) 10Dzahn: "Sorry, I don't think I really remember this or have something to add here. So in an effort to clean up my Gerrit I will respectfully remov" [puppet] - 10https://gerrit.wikimedia.org/r/765629 (owner: 10Jbond) [19:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:04:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: stop phab2001 from being an rsync client [puppet] - 10https://gerrit.wikimedia.org/r/852261 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:05:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P37852 and previous config saved to /var/cache/conftool/dbconfig/20221102-190531-marostegui.json [19:08:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10RobH) [19:08:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10RobH) [19:09:38] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10RobH) @Marostegui, Can you populate the racking info (partitioning, network details, any rack restrictions) and then assign this over to @Jclark-ctr? Thanks! [19:11:21] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=phab2001-vcs.codfw.wmnet [19:11:32] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [19:15:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318605)', diff saved to https://phabricator.wikimedia.org/P37853 and previous config saved to /var/cache/conftool/dbconfig/20221102-191557-ladsgroup.json [19:15:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [19:16:09] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:16:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [19:16:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [19:16:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [19:16:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T318605)', diff saved to https://phabricator.wikimedia.org/P37854 and previous config saved to /var/cache/conftool/dbconfig/20221102-191623-ladsgroup.json [19:17:53] Gerrit is awfully slow [19:19:15] Maybe back [19:19:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318605)', diff saved to https://phabricator.wikimedia.org/P37855 and previous config saved to /var/cache/conftool/dbconfig/20221102-192014-ladsgroup.json [19:20:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321123)', diff saved to https://phabricator.wikimedia.org/P37856 and previous config saved to /var/cache/conftool/dbconfig/20221102-192039-marostegui.json [19:20:50] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) BIOS reset to factory `` [Job ID=JID_674345824899] Job Name=System_Erase Status=Completed Scheduled Start Time=[Now] Expiration Time=[Not Applicable] Actual Start Time=[Not Applicable] Act... [19:21:38] (03PS2) 10DDesouza: Remove Research Incentive survey from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851713 (https://phabricator.wikimedia.org/T318333) [19:21:45] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [19:22:45] (03PS1) 10JHathaway: aux-k8s: add wikimedia cluster for workers [puppet] - 10https://gerrit.wikimedia.org/r/852275 (https://phabricator.wikimedia.org/T321137) [19:23:32] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add wikimedia cluster for workers [puppet] - 10https://gerrit.wikimedia.org/r/852275 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [19:25:37] (03CR) 10BBlack: [C: 03+1] varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [19:31:40] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [19:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P37857 and previous config saved to /var/cache/conftool/dbconfig/20221102-193522-ladsgroup.json [19:35:28] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [19:36:17] (03PS9) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [19:37:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:38:21] (03CR) 10CI reject: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [19:38:34] PROBLEM - Check systemd state on aux-k8s-worker1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:39:59] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10MusikAnimal) I originally read T320675#8330640 as giving the go-ahead, but with the shared unders... [19:44:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:44:46] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4052 [19:45:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4052 [19:47:40] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Add ms-be0[78] as storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/852245 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [19:48:15] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [19:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318605)', diff saved to https://phabricator.wikimedia.org/P37858 and previous config saved to /var/cache/conftool/dbconfig/20221102-195019-ladsgroup.json [19:50:27] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P37859 and previous config saved to /var/cache/conftool/dbconfig/20221102-195029-ladsgroup.json [19:52:41] (03PS2) 10Dzahn: rename varnish service alias for phab2001-aphlict? [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) [19:53:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:54:06] (03PS3) 10Dzahn: delete varnish service alias for phab2001-aphlict [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) [19:54:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:55:39] (03CR) 10BBlack: [C: 03+1] delete varnish service alias for phab2001-aphlict [dns] - 10https://gerrit.wikimedia.org/r/852272 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:56:23] (03PS1) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 [19:57:36] (03CR) 10CI reject: [V: 04-1] worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (owner: 10Jbond) [19:59:58] I can deploy! [20:00:04] (03CR) 10BBlack: [C: 04-1] "Getting very very close now!" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221102T2000). [20:00:05] Jdlrobson and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] (03PS2) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 [20:00:12] i can deploy today [20:00:28] or you can :D [20:00:34] TheresNoTime: oh, you were quicker than jouncebot! [20:00:45] feel free to do it :) [20:00:51] okay ^^ [20:01:10] Jdlrobson: hi, going to start with your logos patch if you're ready? [20:01:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:01:26] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:01:46] (03PS6) 10Samtar: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:02:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:02:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:02:21] (03CR) 10CI reject: [V: 04-1] worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (owner: 10Jbond) [20:02:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:02:56] or danisztls we can start with yours if you're (more) about? :) [20:03:12] TheresNoTime: ok [20:03:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851713 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:04:34] (03Merged) 10jenkins-bot: Remove Research Incentive survey from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851713 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:04:50] (03PS1) 10Vgutierrez: swift: Add ms-be0[78] to deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/852281 (https://phabricator.wikimedia.org/T322231) [20:05:03] !log samtar@deploy1002 Started scap: Backport for [[gerrit:851713|Remove Research Incentive survey from enwiki (T318333)]] [20:05:18] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:05:19] T318333: Deploy Research Incentive Survey targeting Sub-Saharan Africa and Latin America readers on English Wikipedia - https://phabricator.wikimedia.org/T318333 [20:05:26] !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:851713|Remove Research Incentive survey from enwiki (T318333)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:05:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P37860 and previous config saved to /var/cache/conftool/dbconfig/20221102-200528-ladsgroup.json [20:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318605)', diff saved to https://phabricator.wikimedia.org/P37861 and previous config saved to /var/cache/conftool/dbconfig/20221102-200537-ladsgroup.json [20:05:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [20:05:42] danisztls: live on mwdebug, can you test? :) [20:05:48] TheresNoTime: yes [20:06:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [20:06:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T318605)', diff saved to https://phabricator.wikimedia.org/P37862 and previous config saved to /var/cache/conftool/dbconfig/20221102-200610-ladsgroup.json [20:06:38] TheresNoTime: lgtm [20:06:47] cool, syncing [20:07:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:07:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:08:04] (03CR) 10Samtar: swift: Add ms-be0[78] to deployment-prep cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852281 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:08:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:08:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:09:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:09:54] (03CR) 10Vgutierrez: swift: Add ms-be0[78] to deployment-prep cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852281 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:10:08] hey sorry im late TheresNoTime [20:10:21] Jdlrobson: no worries, you're next :) [20:10:48] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:851713|Remove Research Incentive survey from enwiki (T318333)]] (duration: 05m 45s) [20:10:49] (03CR) 10Samtar: [C: 03+1] swift: Add ms-be0[78] to deployment-prep cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852281 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:10:55] T318333: Deploy Research Incentive Survey targeting Sub-Saharan Africa and Latin America readers on English Wikipedia - https://phabricator.wikimedia.org/T318333 [20:11:03] that's live danisztls :) [20:11:11] TheresNoTime: thanks! [20:11:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:12:08] (03Merged) 10jenkins-bot: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:12:29] (03CR) 10Vgutierrez: [C: 03+2] swift: Add ms-be0[78] to deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/852281 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:12:36] !log samtar@deploy1002 Started scap: Backport for [[gerrit:849175|Fix remaining Wikipedia logos (T319223)]] [20:12:44] T319223: [XL] Deploy new set of logos for all Wikipedias except Gothic Wikipedia - https://phabricator.wikimedia.org/T319223 [20:13:00] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:849175|Fix remaining Wikipedia logos (T319223)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:13:09] Jdlrobson: live on mwdebug, can you test a (few) sites? :) [20:13:16] I'm ready to purge the cache this time :D [20:13:34] TheresNoTime: looking [20:14:18] TheresNoTime LGTM [20:14:25] syncing :) [20:14:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:14:45] (03PS1) 10Jbond: controler: add the hostname to the state to enabl debugging [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 [20:15:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:15:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:16:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:16:19] (03PS2) 10Andrew Bogott: designate nova_fixed_multi: use _update_or_delete_recordset [puppet] - 10https://gerrit.wikimedia.org/r/852270 (https://phabricator.wikimedia.org/T305828) [20:16:42] (03CR) 10CI reject: [V: 04-1] controler: add the hostname to the state to enabl debugging [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (owner: 10Jbond) [20:17:14] (03CR) 10Andrew Bogott: [C: 03+2] designate nova_fixed_multi: use _update_or_delete_recordset [puppet] - 10https://gerrit.wikimedia.org/r/852270 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:17:29] TheresNoTime: thanks! seeing it now without Xdebug! [20:17:35] great :D [20:17:46] (it's almost finished syncing out) [20:18:02] then I'll clear the caches one last time just to be safe [20:18:23] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:849175|Fix remaining Wikipedia logos (T319223)]] (duration: 05m 46s) [20:18:35] T319223: [XL] Deploy new set of logos for all Wikipedias except Gothic Wikipedia - https://phabricator.wikimedia.org/T319223 [20:18:50] Done and caches cleared :) [20:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P37863 and previous config saved to /var/cache/conftool/dbconfig/20221102-202037-ladsgroup.json [20:21:01] * TheresNoTime will be around for another 10 minutes or so if there's any more patches [20:21:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:22:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:22:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:23:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318605)', diff saved to https://phabricator.wikimedia.org/P37864 and previous config saved to /var/cache/conftool/dbconfig/20221102-202815-ladsgroup.json [20:28:45] (03PS1) 10Vgutierrez: swift: Drain ms-be0[56]@deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/852290 (https://phabricator.wikimedia.org/T322231) [20:29:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:30:04] (03CR) 10Vgutierrez: [C: 03+2] swift: Drain ms-be0[56]@deployment-prep cluster [puppet] - 10https://gerrit.wikimedia.org/r/852290 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:59] (03CR) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [20:31:56] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:32:26] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: manifests for glance, nova, keystone, placement version Y [puppet] - 10https://gerrit.wikimedia.org/r/851168 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:32:28] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: Add manifests for Neutron version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851169 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:32:30] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: Add manifests for Trove version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851170 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:32:32] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: Add manifests for Heat and Magnum version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851171 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:32:34] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: Add manifests for Cinder version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851172 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:32:36] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: Add manifests for Barbican version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851173 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:32:38] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev openstack -> version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851174 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [20:33:11] !log UTC late backport window [20:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:29] um, "closing" UTC late backport window, but w/e [20:35:18] (03PS1) 10Vgutierrez: swift: Fix syntax error on deployment-prep config file [puppet] - 10https://gerrit.wikimedia.org/r/852293 (https://phabricator.wikimedia.org/T322231) [20:35:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318605)', diff saved to https://phabricator.wikimedia.org/P37865 and previous config saved to /var/cache/conftool/dbconfig/20221102-203547-ladsgroup.json [20:35:48] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:35:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [20:35:53] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:36:07] (03CR) 10Vgutierrez: [C: 03+2] swift: Fix syntax error on deployment-prep config file [puppet] - 10https://gerrit.wikimedia.org/r/852293 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [20:36:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [20:36:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37866 and previous config saved to /var/cache/conftool/dbconfig/20221102-203621-ladsgroup.json [20:38:23] (03CR) 10BCornwall: [C: 03+2] varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [20:39:25] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) >>! In T320675#8364656, @MusikAnimal wrote: > I originally read T320675#8330640 as giving... [20:41:18] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet-catalog-compiler: compilation result randomly places servers in the wrong section - https://phabricator.wikimedia.org/T224977 (10jbond) i have done some tests today and notice that the error is in the get_states function where we... [20:43:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P37867 and previous config saved to /var/cache/conftool/dbconfig/20221102-204325-ladsgroup.json [20:51:31] (03PS12) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [20:54:15] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:55:19] (03CR) 10BCornwall: "I've also rebased/edited to use spaces instead of tabs now that I3fb7c93ca32c4f2bbb9fdc3224d8a89bfe672cc7 has been merged in." [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [20:56:52] (03PS2) 10Jbond: controler: add the hostname to the state to enabl debugging [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 [20:58:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P37868 and previous config saved to /var/cache/conftool/dbconfig/20221102-205833-ladsgroup.json [21:00:46] (03Abandoned) 10Jforrester: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/782183 (owner: 10PipelineBot) [21:01:19] (03CR) 10CI reject: [V: 04-1] controler: add the hostname to the state to enabl debugging [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (owner: 10Jbond) [21:10:44] (03CR) 10BBlack: [C: 03+1] "Success I think! (but let's merge tomorrow when there's more daylight to keep an eye on things though). Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [21:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318605)', diff saved to https://phabricator.wikimedia.org/P37869 and previous config saved to /var/cache/conftool/dbconfig/20221102-211342-ladsgroup.json [21:13:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:13:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:13:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:14:02] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Thanks @Volans. After you poke around a bit, let us know the best way of proceeding forward. If we can automate it via webhooks, that would be terrific. But i... [21:21:50] (03PS3) 10Jbond: controler: add the hostname to the state to enabl debugging [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 [21:24:07] (03CR) 10CI reject: [V: 04-1] controler: add the hostname to the state to enabl debugging [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (owner: 10Jbond) [21:28:08] (03PS4) 10Jbond: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) [21:30:22] (03CR) 10CI reject: [V: 04-1] controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [21:31:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [21:35:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [21:39:46] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10wiki_willy) a:03Cmjohnson [21:40:07] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10wiki_willy) a:03Cmjohnson [21:42:00] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10wiki_willy) a:03Jclark-ctr [21:50:40] (03CR) 10Jbond: "See the following for an example, ill fix tests tomorrow" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [21:53:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp4052'] [21:53:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp4052'] [21:57:49] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp4052'] [21:58:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp4052'] [22:06:29] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10LSobanski) [22:08:15] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [22:11:18] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) @Volans @jbond the provision cookbook did run with no issues but the firmware cookbook did fail because it was missing "FileNotFoundError: [Errno 2] No such file or directory: '/srv/firmwar... [22:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37871 and previous config saved to /var/cache/conftool/dbconfig/20221102-224014-ladsgroup.json [22:40:19] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:51:54] (03PS1) 10Andrew Bogott: Rename live_upgrade_ussuri_to_victoria.py to remove version-specific name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852312 [22:51:56] (03PS1) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [22:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P37872 and previous config saved to /var/cache/conftool/dbconfig/20221102-225523-ladsgroup.json [22:55:50] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [22:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:00:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:05:14] (03PS1) 10Arlolra: Media border option applies to the media element, not the wrapper [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) [23:06:07] (03CR) 10Arlolra: "There's no train next week, so I'd like to backport this" [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [23:06:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:10] PROBLEM - Disk space on conf1007 is CRITICAL: DISK CRITICAL - free space: / 2685 MB (3% inode=98%): /tmp 2685 MB (3% inode=98%): /var/tmp 2685 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1007&var-datasource=eqiad+prometheus/ops [23:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P37873 and previous config saved to /var/cache/conftool/dbconfig/20221102-231031-ladsgroup.json [23:14:25] (03CR) 10Jforrester: "Why not just set wgVisualEditorEnableDiffPage directly in InitialiseSettings.php? No need to go through wmgVisualEditorEnableDiffPage re-d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders) [23:14:52] (03PS2) 10Andrew Bogott: Rename live_upgrade_ussuri_to_victoria.py to remove version-specific name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852312 [23:14:54] (03PS2) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [23:18:01] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [23:25:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37874 and previous config saved to /var/cache/conftool/dbconfig/20221102-232540-ladsgroup.json [23:25:44] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:35:06] PROBLEM - Disk space on conf1009 is CRITICAL: DISK CRITICAL - free space: / 2384 MB (3% inode=98%): /tmp 2384 MB (3% inode=98%): /var/tmp 2384 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=conf1009&var-datasource=eqiad+prometheus/ops [23:36:37] (03CR) 10Jdlrobson: [C: 03+1] Media border option applies to the media element, not the wrapper [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [23:44:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:44:27] (03CR) 10Subramanya Sastry: [C: 03+1] Media border option applies to the media element, not the wrapper [skins/MinervaNeue] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/852314 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [23:45:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:55:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown