[00:05:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:05:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037490 (owner: 10TrainBranchBot) [00:10:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:11:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:12:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:12:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:14:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:14:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:19:00] (03CR) 10Dzahn: "a concern here would be that it seems like it would break in cloud for the test instance" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [00:19:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:19:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:21:29] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:21:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:23:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:23:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:25:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:25:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:27:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:27:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:37:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:37:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:39:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:39:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:44:40] (03CR) 10Dzahn: "This makes the class unusable in cloud VPS because we still have no IPv6 there. Ticket is open since 2012 https://phabricator.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [00:46:19] (03CR) 10Dzahn: "before we got away with setting it to "undefined" in Hiera but now not anymore since it expects an array and matching Stdlib::Fqdn" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [00:46:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:46:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 9.036% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:48:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:49:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:51:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:51:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:51:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 885.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:51:33] (03CR) 10Dzahn: "setting "profile::gerrit::ipv6: '::1'" could work though (--> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036771)" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [00:55:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.4121462759875094s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:56:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 928.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:00:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.40000280262799653s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [01:01:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 827.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:04:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:04:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:06:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 896.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:06:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:06:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:12:45] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 935.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:33] (03PS1) 10Papaul: Add lsw1-c1 to devices.yaml to test homer after ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/1037651 (https://phabricator.wikimedia.org/T360789) [01:16:48] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:17:45] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 980.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:18:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:18:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:20:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:20:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:22:45] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 909.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:24:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:24:37] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:26:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:26:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:27:45] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 914.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:28:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:28:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:31:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:31:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:37:30] FIRING: ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:42:30] RESOLVED: ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:51:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.356s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:52:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.43229789425248466s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:56:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.084s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:56:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.705s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:57:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.41258821927211314s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [02:01:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 812.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:06:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.276s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:08:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.4224784006038208s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 972.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:17:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 983.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:18:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.4196329367059834s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:22:45] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 841.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:25:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:25:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 20.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:43:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:44:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:45:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:50:29] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:50:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:55:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:39] (03PS1) 10MusikAnimal: [beta] Enable CodeMirrorRTL on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037658 (https://phabricator.wikimedia.org/T170001) [03:03:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:09:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:15:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [03:25:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [03:28:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:28:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:37:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:37:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:38:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T352010)', diff saved to https://phabricator.wikimedia.org/P63736 and previous config saved to /var/cache/conftool/dbconfig/20240531-033826-ladsgroup.json [03:38:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:39:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:39:25] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:40:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T364299)', diff saved to https://phabricator.wikimedia.org/P63737 and previous config saved to /var/cache/conftool/dbconfig/20240531-034016-marostegui.json [03:40:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:41:22] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:41:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:43:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:43:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:45:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:45:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:53:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P63738 and previous config saved to /var/cache/conftool/dbconfig/20240531-035334-ladsgroup.json [03:54:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:54:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:55:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P63739 and previous config saved to /var/cache/conftool/dbconfig/20240531-035524-marostegui.json [03:59:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:59:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:01:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:02:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:05:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:07:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:07:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:08:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P63740 and previous config saved to /var/cache/conftool/dbconfig/20240531-040842-ladsgroup.json [04:09:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:09:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:10:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P63741 and previous config saved to /var/cache/conftool/dbconfig/20240531-041032-marostegui.json [04:11:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:11:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:13:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:13:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:23:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T352010)', diff saved to https://phabricator.wikimedia.org/P63742 and previous config saved to /var/cache/conftool/dbconfig/20240531-042350-ladsgroup.json [04:23:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [04:23:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:24:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [04:24:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T352010)', diff saved to https://phabricator.wikimedia.org/P63743 and previous config saved to /var/cache/conftool/dbconfig/20240531-042414-ladsgroup.json [04:25:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T364299)', diff saved to https://phabricator.wikimedia.org/P63744 and previous config saved to /var/cache/conftool/dbconfig/20240531-042540-marostegui.json [04:25:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [04:25:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:25:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [04:26:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T364299)', diff saved to https://phabricator.wikimedia.org/P63745 and previous config saved to /var/cache/conftool/dbconfig/20240531-042604-marostegui.json [04:47:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:47:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:49:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:50:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:51:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:52:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:53:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:53:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:55:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:55:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:02:52] (03PS1) 10Marostegui: db1165: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037663 [05:03:16] (03CR) 10Marostegui: [C:03+2] db1165: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1037663 (owner: 10Marostegui) [05:07:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:07:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:09:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:09:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:11:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:12:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:14:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:16:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:16:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:16:48] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:18:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:18:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:20:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:20:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:26:26] 06SRE, 06All-and-every-Wikisource, 06Product-Analytics, 07Bengali-Sites, 07SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607#9849379 (10Bodhisattwa) [[ https://searchengineland.com/google-search-document-leak-ranking-442617 | This article ]] might be relevant [05:26:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T364299)', diff saved to https://phabricator.wikimedia.org/P63746 and previous config saved to /var/cache/conftool/dbconfig/20240531-052631-marostegui.json [05:26:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:28:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:28:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:30:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:30:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:32:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:32:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:34:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:34:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:36:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:36:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:38:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:39:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:41:26] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:41:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:41:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P63747 and previous config saved to /var/cache/conftool/dbconfig/20240531-054139-marostegui.json [05:43:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:43:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:45:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:45:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:47:22] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:47:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:49:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:49:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:56:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P63748 and previous config saved to /var/cache/conftool/dbconfig/20240531-055647-marostegui.json [05:58:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:58:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240531T0600) [06:00:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:00:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:11:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T364299)', diff saved to https://phabricator.wikimedia.org/P63749 and previous config saved to /var/cache/conftool/dbconfig/20240531-061156-marostegui.json [06:11:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [06:12:03] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:12:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [06:12:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T364299)', diff saved to https://phabricator.wikimedia.org/P63750 and previous config saved to /var/cache/conftool/dbconfig/20240531-061219-marostegui.json [06:20:03] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2043.codfw.wmnet with OS bookworm [06:20:05] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1043.eqiad.wmnet with OS bookworm [06:23:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:23:08] (03PS2) 10Stevemunene: Remove datahub service entry [puppet] - 10https://gerrit.wikimedia.org/r/1037479 (https://phabricator.wikimedia.org/T366137) [06:23:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:25:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:25:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:33:16] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1043.eqiad.wmnet with reason: host reimage [06:33:45] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [06:33:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:33:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:36:41] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1043.eqiad.wmnet with reason: host reimage [06:38:28] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2043.codfw.wmnet with reason: host reimage [06:40:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:40:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:41:24] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2043.codfw.wmnet with reason: host reimage [06:42:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:42:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:44:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:44:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:46:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:46:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:52:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:52:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:52:34] RECOVERY - Disk space on backup1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops [06:52:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1043.eqiad.wmnet with OS bookworm [06:54:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:54:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:58:25] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2043.codfw.wmnet with OS bookworm [06:59:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [06:59:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240531T0700) [07:01:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:01:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:05:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:05:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:09:17] (03PS1) 10Muehlenhoff: Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037667 [07:10:15] (03CR) 10CI reject: [V:04-1] Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037667 (owner: 10Muehlenhoff) [07:10:31] (03CR) 10Brouberol: [C:03+1] "praise: great job with this migration. Let's clean up that thang." [puppet] - 10https://gerrit.wikimedia.org/r/1037479 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [07:12:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:12:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:13:07] (03CR) 10Arnaudb: [C:03+1] "happy friday!" [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:14:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:14:48] (03CR) 10MVernon: [C:03+2] New cephadm::rgw role [puppet] - 10https://gerrit.wikimedia.org/r/1037558 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:14:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:16:19] (03PS2) 10Muehlenhoff: Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037667 [07:16:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:16:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:17:07] (03CR) 10CI reject: [V:04-1] Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037667 (owner: 10Muehlenhoff) [07:17:14] (03CR) 10Stevemunene: [C:03+2] Remove datahub service entry [puppet] - 10https://gerrit.wikimedia.org/r/1037479 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene) [07:18:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:18:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:20:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:20:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:21:36] (03PS1) 10Muehlenhoff: Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037668 [07:22:23] (03CR) 10CI reject: [V:04-1] Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037668 (owner: 10Muehlenhoff) [07:25:15] (03PS3) 10Muehlenhoff: Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037667 [07:28:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:28:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:29:30] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037667 (owner: 10Muehlenhoff) [07:30:45] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on moss-fe1002.eqiad.wmnet with reason: in development [07:30:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on moss-fe1002.eqiad.wmnet with reason: in development [07:31:09] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9849449 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=418e7ac9-ef32-4b52-9f19-645af27090e2) set by mvernon@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: in... [07:31:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:31:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:32:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [07:33:36] (03CR) 10Ayounsi: [C:03+1] Policy for Novvacore at magru to not announce Anycast to Cogent [homer/public] - 10https://gerrit.wikimedia.org/r/1037588 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [07:34:09] (03CR) 10Muehlenhoff: [C:03+2] Remove skel files for former WMF staff members [puppet] - 10https://gerrit.wikimedia.org/r/1037064 (owner: 10Muehlenhoff) [07:37:25] (03CR) 10Ayounsi: [C:03+1] "Note that you might need to temporarily change its Netbox status for homer to accept to run on it." [homer/public] - 10https://gerrit.wikimedia.org/r/1037651 (https://phabricator.wikimedia.org/T360789) (owner: 10Papaul) [07:38:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:38:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:40:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:40:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:40:31] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1042.eqiad.wmnet with OS bookworm [07:40:34] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2042.codfw.wmnet with OS bookworm [07:42:11] (03Abandoned) 10Muehlenhoff: Remove LDAP access for kelhurd [puppet] - 10https://gerrit.wikimedia.org/r/1037668 (owner: 10Muehlenhoff) [07:48:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:48:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:52:40] (03CR) 10Brouberol: "The patch itself looks good and safe enough. I'd like to get a second pair of eyes from someone with more WDQS experience, cc @rkemper@wik" [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) (owner: 10Lucas Werkmeister) [07:52:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:53:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:54:08] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1042.eqiad.wmnet with reason: host reimage [07:54:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:54:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:56:21] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1042.eqiad.wmnet with reason: host reimage [07:56:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:56:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:57:52] (03CR) 10Driedmueller: "Probably my fault due to less experience with this process. I thought the latest branch is suitable, from where i also branched. I change " [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller) [07:58:51] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2042.codfw.wmnet with reason: host reimage [07:58:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:59:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:02:00] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2042.codfw.wmnet with reason: host reimage [08:03:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:03:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:04:04] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9849495 (10ayounsi) That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure... [08:05:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:05:43] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:09:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:09:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:12:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:12:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:12:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1042.eqiad.wmnet with OS bookworm [08:14:29] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:14:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:16:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:16:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:20:02] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2042.codfw.wmnet with OS bookworm [08:25:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:25:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:27:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:27:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:29:00] !log dcausse@deploy1002 Started deploy [airflow-dags/search@dabf423]: search: fix graph split snapshot column [08:29:20] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@dabf423]: search: fix graph split snapshot column (duration: 00m 20s) [08:36:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:36:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:38:59] (03Abandoned) 10Driedmueller: Dont recalculate winners from scratch each round [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller) [08:39:44] (03CR) 10Driedmueller: Dont recalculate winners from scratch each round (031 comment) [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller) [08:41:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:41:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:44:14] (03PS1) 10Muehlenhoff: Deprecate system::role for mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1037729 [08:46:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:46:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:47:06] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1041.eqiad.wmnet with OS bookworm [08:47:17] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2041.codfw.wmnet with OS bookworm [08:48:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:48:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:50:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:50:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:52:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:52:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:56:06] (03PS1) 10Muehlenhoff: Deprecate system::role for toolforge roles [puppet] - 10https://gerrit.wikimedia.org/r/1037730 [08:57:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037729 (owner: 10Muehlenhoff) [09:00:23] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1041.eqiad.wmnet with reason: host reimage [09:02:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:02:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:03:54] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1041.eqiad.wmnet with reason: host reimage [09:04:29] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:04:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:05:03] (03PS1) 10Ayounsi: magru: add BGP to HE [homer/public] - 10https://gerrit.wikimedia.org/r/1037736 (https://phabricator.wikimedia.org/T362421) [09:05:53] (03CR) 10Ayounsi: [C:03+2] magru: add BGP to HE [homer/public] - 10https://gerrit.wikimedia.org/r/1037736 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [09:06:06] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2041.codfw.wmnet with reason: host reimage [09:06:24] (03Merged) 10jenkins-bot: magru: add BGP to HE [homer/public] - 10https://gerrit.wikimedia.org/r/1037736 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [09:06:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:06:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:08:23] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2041.codfw.wmnet with reason: host reimage [09:09:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:09:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:09:43] (03PS1) 10Brouberol: Leverage the internal datahub kafka registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037738 (https://phabricator.wikimedia.org/T363461) [09:11:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:11:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:13:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:13:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:15:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:16:48] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:19:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1041.eqiad.wmnet with OS bookworm [09:20:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:21:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:21:33] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1037730 (owner: 10Muehlenhoff) [09:22:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:22:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:23:52] (03PS2) 10Brouberol: Leverage the internal datahub kafka registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037738 (https://phabricator.wikimedia.org/T363461) [09:24:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:24:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:25:45] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2041.codfw.wmnet with OS bookworm [09:27:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:27:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:27:35] (03CR) 10Marostegui: [C:03+1] Deprecate system::role for mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1037729 (owner: 10Muehlenhoff) [09:29:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:29:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:31:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:31:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:33:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:33:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:58] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1040.eqiad.wmnet with OS bookworm [09:42:01] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2040.codfw.wmnet with OS bookworm [09:46:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:46:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:46:39] (03CR) 10Brouberol: "Once we have that, we could manually run a `datahub ingest` command from a test an-worker host, to see what changes we have to make in htt" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037738 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:51:01] (03PS1) 10Effie Mouzeli: memcached: finish mediawiki migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037741 (https://phabricator.wikimedia.org/T352891) [09:51:21] (03CR) 10CI reject: [V:04-1] memcached: finish mediawiki migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037741 (https://phabricator.wikimedia.org/T352891) (owner: 10Effie Mouzeli) [09:51:53] (03PS2) 10Effie Mouzeli: memcached: finish mediawiki migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037741 (https://phabricator.wikimedia.org/T352891) [09:51:59] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037741 (https://phabricator.wikimedia.org/T352891) (owner: 10Effie Mouzeli) [09:55:15] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1040.eqiad.wmnet with reason: host reimage [09:56:15] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037517 (owner: 10L10n-bot) [09:57:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:57:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:58:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1040.eqiad.wmnet with reason: host reimage [09:59:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:59:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:00:04] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2040.codfw.wmnet with reason: host reimage [10:01:50] (03CR) 10Cathal Mooney: "LGTM, one typo I think you accidentally pasted some text with the IP." [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) (owner: 10Ayounsi) [10:02:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:02:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:02:19] (03CR) 10Cathal Mooney: [C:03+2] Policy for Novvacore at magru to not announce Anycast to Cogent [homer/public] - 10https://gerrit.wikimedia.org/r/1037588 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [10:02:47] (03Merged) 10jenkins-bot: Policy for Novvacore at magru to not announce Anycast to Cogent [homer/public] - 10https://gerrit.wikimedia.org/r/1037588 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [10:03:09] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2040.codfw.wmnet with reason: host reimage [10:03:18] (03PS3) 10Effie Mouzeli: memcached: finish mediawiki memcached migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037741 (https://phabricator.wikimedia.org/T352891) [10:04:09] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, I think the Jinja templates may hit an error without the BGP data in sites.yaml added as in the full patch (Iaf402b9bb89370c9fdbea23" [homer/public] - 10https://gerrit.wikimedia.org/r/1037651 (https://phabricator.wikimedia.org/T360789) (owner: 10Papaul) [10:04:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:04:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:06:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:06:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:08:04] (03CR) 10Effie Mouzeli: [C:03+2] memcached: finish mediawiki memcached migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1037741 (https://phabricator.wikimedia.org/T352891) (owner: 10Effie Mouzeli) [10:08:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:08:37] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:10:13] (03CR) 10Ladsgroup: [C:03+1] "Overall looks good to me. I add Timo who might have notes to say." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [10:13:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:13:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:14:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1040.eqiad.wmnet with OS bookworm [10:15:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:15:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:17:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:18:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T364069)', diff saved to https://phabricator.wikimedia.org/P63751 and previous config saved to /var/cache/conftool/dbconfig/20240531-101800-marostegui.json [10:18:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:18:11] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:21:10] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2040.codfw.wmnet with OS bookworm [10:21:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:21:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:23:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:23:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:32:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T364299)', diff saved to https://phabricator.wikimedia.org/P63752 and previous config saved to /var/cache/conftool/dbconfig/20240531-103245-marostegui.json [10:32:51] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:33:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P63753 and previous config saved to /var/cache/conftool/dbconfig/20240531-103308-marostegui.json [10:35:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:35:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:43:53] (03PS1) 10Muehlenhoff: mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766 [10:44:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:44:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:46:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:46:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:47:23] (03PS2) 10Kosta Harlan: geoip: Use GeoLite2 instead of GeoIP2 Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [10:47:23] (03PS3) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) [10:47:23] (03PS1) 10Kosta Harlan: maxmind: Fix parameter order and document user_id/license_key defaults [puppet] - 10https://gerrit.wikimedia.org/r/1037767 [10:47:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1039.eqiad.wmnet with OS bookworm [10:47:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:47:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqsin.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:47:50] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2039.codfw.wmnet with OS bookworm [10:47:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P63754 and previous config saved to /var/cache/conftool/dbconfig/20240531-104753-marostegui.json [10:48:13] (03CR) 10CI reject: [V:04-1] maxmind: Fix parameter order and document user_id/license_key defaults [puppet] - 10https://gerrit.wikimedia.org/r/1037767 (owner: 10Kosta Harlan) [10:48:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P63755 and previous config saved to /var/cache/conftool/dbconfig/20240531-104816-marostegui.json [10:49:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9849768 (10cmooney) So I tested pushing with Homer to the devices in row D and it was pretty much successful :) NOTE: As the devices need some additional... [10:50:04] (03PS2) 10Kosta Harlan: maxmind: Fix parameter order and document user_id/license_key defaults [puppet] - 10https://gerrit.wikimedia.org/r/1037767 [10:50:04] (03PS3) 10Kosta Harlan: geoip: Use GeoLite2 instead of GeoIP2 Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [10:50:04] (03PS4) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) [10:52:45] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:53:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:53:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:53:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: Maintenance [10:54:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: Maintenance [10:55:10] (03PS5) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) [10:55:10] (03PS4) 10Kosta Harlan: geoip: Use GeoLite2 instead of GeoIP2 Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [10:55:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:55:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:55:28] (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:55:39] (03CR) 10Kosta Harlan: geoip: Download GeoLite2 ASN file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:57:45] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240531T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240531T1100). [11:00:29] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037738 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:03:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P63756 and previous config saved to /var/cache/conftool/dbconfig/20240531-110301-marostegui.json [11:03:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T364069)', diff saved to https://phabricator.wikimedia.org/P63757 and previous config saved to /var/cache/conftool/dbconfig/20240531-110324-marostegui.json [11:03:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [11:03:30] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:03:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [11:03:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T364069)', diff saved to https://phabricator.wikimedia.org/P63758 and previous config saved to /var/cache/conftool/dbconfig/20240531-110347-marostegui.json [11:04:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:04:37] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:06:20] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2039.codfw.wmnet with reason: host reimage [11:06:52] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [11:07:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T352010)', diff saved to https://phabricator.wikimedia.org/P63759 and previous config saved to /var/cache/conftool/dbconfig/20240531-110719-ladsgroup.json [11:07:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:07:50] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [11:08:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:08:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:09:32] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2039.codfw.wmnet with reason: host reimage [11:10:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff) [11:12:03] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:13:42] 06SRE, 06Infrastructure-Foundations, 10netops: Include vlans with defined IRB int in device vlans even if no port present - https://phabricator.wikimedia.org/T366348 (10cmooney) 03NEW p:05Triage→03Low [11:13:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:13:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:14:53] FIRING: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [11:15:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:16:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:16:02] (03PS1) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) [11:16:07] (03CR) 10CI reject: [V:04-1] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [11:16:59] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:17:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:17:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:17:59] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Include vlans with defined IRB int in device vlans even if no port present - https://phabricator.wikimedia.org/T366348#9849823 (10cmooney) Diff with this patch applied on one of the new codfw switches: ` cmooney@wikilap:~$ homer lsw1-d2-cod... [11:18:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T364299)', diff saved to https://phabricator.wikimedia.org/P63760 and previous config saved to /var/cache/conftool/dbconfig/20240531-111809-marostegui.json [11:18:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [11:18:17] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:18:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [11:18:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T364299)', diff saved to https://phabricator.wikimedia.org/P63761 and previous config saved to /var/cache/conftool/dbconfig/20240531-111833-marostegui.json [11:18:43] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:20:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:21:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:22:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:22:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P63762 and previous config saved to /var/cache/conftool/dbconfig/20240531-112227-ladsgroup.json [11:23:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:24:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:25:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:25:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:26:59] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2039.codfw.wmnet with OS bookworm [11:27:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:28:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:33:05] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Include vlans with defined IRB int in device vlans even if no port present - https://phabricator.wikimedia.org/T366348#9849839 (10cmooney) [11:34:59] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 51 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:37:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P63763 and previous config saved to /var/cache/conftool/dbconfig/20240531-113735-ladsgroup.json [11:37:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:37:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:39:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:39:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:41:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:41:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:42:05] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [11:42:06] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2038.codfw.wmnet with OS bookworm [11:43:03] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [11:43:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:43:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:45:44] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:45:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:46:54] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1039.eqiad.wmnet with OS bookworm [11:47:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:47:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:49:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:49:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:50:11] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [11:50:46] (03PS1) 10Marostegui: db1209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037774 [11:51:09] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [11:51:47] (03CR) 10Marostegui: [C:03+2] db1209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1037774 (owner: 10Marostegui) [11:51:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:51:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1209.eqiad.wmnet with OS bookworm [11:52:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:52:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T352010)', diff saved to https://phabricator.wikimedia.org/P63764 and previous config saved to /var/cache/conftool/dbconfig/20240531-115244-ladsgroup.json [11:52:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:52:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:53:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:53:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:54:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:56:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:56:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:58:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:58:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:00:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:00:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:00:32] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: host reimage [12:03:19] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: host reimage [12:04:56] (03PS1) 10Marostegui: Revert "db1209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1037748 [12:05:44] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage [12:06:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:06:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:07:03] PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [12:07:53] RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset -0.000368151 secs https://wikitech.wikimedia.org/wiki/NTP [12:08:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:08:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:28:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037778 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:30:20] (03CR) 10Arnaudb: [C:03+1] varnish: mitigate a hotlink causing issues [puppet] - 10https://gerrit.wikimedia.org/r/1037779 (owner: 10Hnowlan) [12:30:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1209.eqiad.wmnet with OS bookworm [12:30:43] (03CR) 10Alexandros Kosiaris: [C:03+1] varnish: mitigate a hotlink causing issues [puppet] - 10https://gerrit.wikimedia.org/r/1037779 (owner: 10Hnowlan) [12:30:51] (03CR) 10Hnowlan: [C:03+2] varnish: mitigate a hotlink causing issues [puppet] - 10https://gerrit.wikimedia.org/r/1037779 (owner: 10Hnowlan) [12:31:09] (03CR) 10Fabfur: [C:03+1] varnish: mitigate a hotlink causing issues [puppet] - 10https://gerrit.wikimedia.org/r/1037779 (owner: 10Hnowlan) [12:31:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:31:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:33:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:33:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:33:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:33:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:37:45] FIRING: [4x] Primary outbound port utilisation over 80% #page: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:39:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2173', diff saved to https://phabricator.wikimedia.org/P63766 and previous config saved to /var/cache/conftool/dbconfig/20240531-123903-root.json [12:39:53] RESOLVED: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [12:40:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:40:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:40:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63767 and previous config saved to /var/cache/conftool/dbconfig/20240531-124058-root.json [12:41:21] (03PS5) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [12:42:45] RESOLVED: [3x] Primary outbound port utilisation over 80% #page: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:42:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [12:42:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:42:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:44:59] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:45:43] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:48:59] (03PS1) 10DCausse: CirrusSearch: add wgCirrusSearchIndexFieldsToCleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037783 [12:49:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:49:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:52:18] (03PS6) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [12:53:37] ebernhardson: are these deploys ^ related to your runs of UpdateSearchIndexConfig.php? [12:53:56] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:54:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:54:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:56:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63768 and previous config saved to /var/cache/conftool/dbconfig/20240531-125604-root.json [12:56:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:56:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:57:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9849984 (10MatthewVernon) p:05Triage→03Medium [12:58:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [12:58:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:00:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:00:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:01:28] (03PS3) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [13:02:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:02:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:04:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:04:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:05:43] (03PS1) 10Ayounsi: [WIP] Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) [13:07:44] (03PS2) 10Ayounsi: [WIP] Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) [13:07:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:07:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:11:05] (03CR) 10Kosta Harlan: [C:04-1] "Waiting for internal product/legal approval but in the meantime, review of approach is welcome" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [13:11:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P63769 and previous config saved to /var/cache/conftool/dbconfig/20240531-131110-root.json [13:11:55] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:14:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:16:10] (03CR) 10Brouberol: [C:03+2] Leverage the internal datahub kafka registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037738 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [13:16:48] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:17:40] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [13:19:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:19:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:21:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:21:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:23:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:23:24] (03PS1) 10MVernon: hiera: add role_contacts to controller & storage roles [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) [13:23:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:25:11] (03PS1) 10MVernon: hiera: update authorized_keys_file on cephadm rgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/1037792 (https://phabricator.wikimedia.org/T279621) [13:25:22] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:25:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:25:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9850026 (10kamila) @VRiley-WMF I am in UTC+2, so US mornings are best for me. Would Tuesday work for you? Thank you! [13:26:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P63770 and previous config saved to /var/cache/conftool/dbconfig/20240531-132616-root.json [13:27:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:27:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:28:01] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [13:29:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:29:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:30:22] (03PS1) 10Brouberol: datahub-next: flip env var to true to support internal registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037793 (https://phabricator.wikimedia.org/T363461) [13:30:54] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037792 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:30:57] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:31:40] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2702/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [13:31:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:31:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:32:15] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [13:34:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:34:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:35:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9850037 (10cmooney) @papaul @Jhancock.wm I noticed that the leaf in rack d8 is reporting one of it's power supplys down: ` cmooney@lsw1-d8-codfw> show sy... [13:36:33] RESOLVED: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:36:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:36:37] (03PS2) 10MVernon: hiera: add role_contacts to controller & storage roles [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) [13:36:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:37:21] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:37:54] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [13:38:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9850039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [13:38:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:38:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:40:23] (03PS3) 10MVernon: hiera: add role_contacts to controller & storage roles [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) [13:40:35] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:40:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:41:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P63771 and previous config saved to /var/cache/conftool/dbconfig/20240531-134122-root.json [13:41:41] (03CR) 10Arnaudb: [C:03+1] hiera: add role_contacts to controller & storage roles [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:41:55] (03CR) 10Arnaudb: [C:03+1] hiera: update authorized_keys_file on cephadm rgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/1037792 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:42:37] (03CR) 10MVernon: [C:03+2] hiera: add role_contacts to controller & storage roles [puppet] - 10https://gerrit.wikimedia.org/r/1037791 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:42:39] (03CR) 10MVernon: [C:03+2] hiera: update authorized_keys_file on cephadm rgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/1037792 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:42:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:42:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:44:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:44:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:46:47] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [13:46:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:46:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:47:46] (03PS2) 10Brouberol: datahub-next: flip env var to true to support internal registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037793 (https://phabricator.wikimedia.org/T363461) [13:48:43] RESOLVED: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:08] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [13:50:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:50:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:51:40] (03CR) 10Jelto: [V:03+1] "looks mostly good. Some questions in-line. If you need a new secret in private puppet let me know. But I'm not 100% sure what is re-used f" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [13:52:16] (03CR) 10Stevemunene: [C:03+1] datahub-next: flip env var to true to support internal registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037793 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [13:52:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:52:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:53:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355 (10MoritzMuehlenhoff) 03NEW [13:53:32] !log dcausse@deploy1002 Started deploy [airflow-dags/search@b2f7795]: search: fix NTripleGenerator arguments [13:53:53] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@b2f7795]: search: fix NTripleGenerator arguments (duration: 00m 21s) [13:53:54] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: update horizon release [puppet] - 10https://gerrit.wikimedia.org/r/1037609 (https://phabricator.wikimedia.org/T365096) (owner: 10Andrew Bogott) [13:54:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:54:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:56:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:56:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:56:28] (03CR) 10Brouberol: [C:03+2] datahub-next: flip env var to true to support internal registry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037793 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [13:56:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P63772 and previous config saved to /var/cache/conftool/dbconfig/20240531-135629-root.json [13:59:43] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:51] (03CR) 10JHathaway: [C:03+1] puppetserver::git::private: Use wrapper from puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1037778 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:59:52] (03PS1) 10Andrew Bogott: Horizon: update eqiad1 release [puppet] - 10https://gerrit.wikimedia.org/r/1037797 (https://phabricator.wikimedia.org/T365096) [14:02:41] (03CR) 10Andrew Bogott: [C:03+2] Horizon: update eqiad1 release [puppet] - 10https://gerrit.wikimedia.org/r/1037797 (https://phabricator.wikimedia.org/T365096) (owner: 10Andrew Bogott) [14:04:33] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:04:37] (03PS4) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [14:08:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:08:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:10:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:10:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:14:39] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9850138 (10ssingh) >>! In T366193#9849495, @ayounsi wrote: > That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or fir... [14:14:44] (03PS6) 10Elukey: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:15:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:15:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:18:23] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:19:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:19:33] RESOLVED: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:19:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:19:46] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9850142 (10cmooney) @ABran-WMF thanks for creating all the tasks! Really appreciated, I did not expect to come back and see that :) >>! In T348977#9837047, @MatthewVe... [14:21:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:21:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:16] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main1010.eqiad.wmnet with OS bullseye [14:24:43] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9850163 (10cmooney) [14:25:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:25:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:25:56] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9850164 (10cmooney) [14:27:24] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9850172 (10VRiley-WMF) After some more troubleshooting, we reset the iDRAC to factory settings and now we can log into the machine via iDRAC. [14:27:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9850169 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney [14:27:49] (03CR) 10Klausman: [C:03+2] revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:27:53] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9850174 (10cmooney) @ABran-WMF thanks for creating all the tasks! Really appreciated, I did not expect to come back and see that :) >>! In T348977#9837047, @MatthewVe... [14:28:06] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9850165 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney [14:28:42] (03Merged) 10jenkins-bot: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [14:29:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:29:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:30:19] (03PS5) 10Elukey: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) [14:32:01] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:32:57] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9850188 (10Clement_Goubert) Thank you @VRiley-WMF ! Tell us when we can bring it back in the cluster. [14:33:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:33:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:35:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:35:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:37:10] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:37:25] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:38:41] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:39:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:39:52] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9850216 (10elukey) [14:40:05] (03PS1) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [14:40:20] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:41:14] (03CR) 10Papaul: [C:03+2] Add lsw1-c1 to devices.yaml to test homer after ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/1037651 (https://phabricator.wikimedia.org/T360789) (owner: 10Papaul) [14:41:54] (03Merged) 10jenkins-bot: Add lsw1-c1 to devices.yaml to test homer after ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/1037651 (https://phabricator.wikimedia.org/T360789) (owner: 10Papaul) [14:42:55] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:43:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:43:48] (03PS5) 10Hnowlan: kubernetes: rename and repurpose 5 api appservers as k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) [14:45:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:45:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:46:03] (03CR) 10JHathaway: "sorry, I didn't realize there was a cloud test instance, does inbound mail work in the cloud test instance? What is the best way to test t" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:47:09] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [14:47:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [14:47:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:47:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:47:33] FIRING: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:49:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:49:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:52:22] (03PS2) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [14:52:33] RESOLVED: KubernetesCalicoDown: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:54:21] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9850229 (10VRiley-WMF) @Clement_Goubert I was just able to provision the server and I believe you should be able to add it. Let us know if there is any other issues with it! [14:54:25] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9850230 (10elukey) 05Open→03Resolved [14:55:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:41] PROBLEM - WDQS SPARQL on wdqs1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:56:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:56:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:57:30] FIRING: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:41] RECOVERY - WDQS SPARQL on wdqs1021 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:59:48] (03CR) 10Clément Goubert: [C:03+1] kubernetes: rename and repurpose 5 api appservers as k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028840 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [15:02:30] RESOLVED: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 18.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:05:37] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:05:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:05:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:07:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:08:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:10:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:10:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:17:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:17:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:24:08] (03PS1) 10Andrew Bogott: cloud-vps: turn off report storage on cloud-vps puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1037812 (https://phabricator.wikimedia.org/T366357) [15:25:37] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9850325 (10akosiaris) [15:26:17] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9850328 (10akosiaris) [15:30:13] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360 (10ssingh) 03NEW [15:30:16] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9850347 (10ssingh) p:05Triage→03Medium [15:30:51] (03PS1) 10Clément Goubert: mw-api-int: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037814 [15:30:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:31:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:31:23] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade ssw1-e1-eqiad to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 (10cmooney) 03NEW p:05Triage→03Medium [15:32:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T364299)', diff saved to https://phabricator.wikimedia.org/P63773 and previous config saved to /var/cache/conftool/dbconfig/20240531-153220-marostegui.json [15:32:26] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:36:06] !log homer 'cr*eqiad*' commit 'T363086' [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:12] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [15:36:30] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9850415 (10ssingh) To clarify, there is //no change// to the configuration of the DNS hosts themselves and the peer list there. This is only for the consumers of `P:systemd::tim... [15:37:07] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9850433 (10Clement_Goubert) Thanks. For information tracking, I put the server back to `Active` in netbox, ran the `sre.dns.netbox` cookbook, running `homer 'cr*eqiad*' commit` right now to resto... [15:38:01] (03CR) 10Clément Goubert: [C:03+2] mw-api-int: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037814 (owner: 10Clément Goubert) [15:38:52] (03Merged) 10jenkins-bot: mw-api-int: Add 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037814 (owner: 10Clément Goubert) [15:39:06] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:39:17] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:39:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:39:35] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:43:17] !log pooling and uncordoning parse1002 - T363086 [15:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:22] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [15:44:15] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=parse1002.eqiad.wmnet,cluster=kubernetes,service=kubesvc [15:44:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:44:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:46:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:46:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:46:38] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9850474 (10Clement_Goubert) 05Open→03Resolved Server back in the cluster, resolving. Thanks again :) [15:47:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P63774 and previous config saved to /var/cache/conftool/dbconfig/20240531-154728-marostegui.json [15:50:44] FIRING: [17x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:48] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:53] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers mw1380.eqiad.wmnet, mw1367.eqiad.wmnet, mw1434.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1463.eqiad.wmnet, mw1424.eqiad.wmnet, kubernetes1012.eqiad.wmnet, mw1465.eqiad.wmnet, mw1466.eqiad.wmnet, mw1369.eqiad.wmnet, mw1419.eqiad.wmnet, mw1469.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1048.eqiad.wmnet, parse1012.eqiad.wm [15:50:53] 68.eqiad.wmnet, mw1431.eqiad.wmnet, kubernetes1008.eqiad.wmnet, parse1021.eqiad.wmnet, kubernetes1056.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1355.eqiad.wmnet, mw1451.eqiad.wmnet, kubernetes1032.eqiad.wmnet, kubernetes1026.eqiad.wmnet, mw1409.eqiad.wmnet, mw1452.eqiad.wmnet, mw1491.eqiad.wmnet, mw1470.eqiad.wmnet, parse1014.eqiad.wmnet, parse1007.eqiad.wmnet, mw1384.eqiad.wmnet, mw1390.eqiad.wmnet, mw1476.eqiad.wmnet, mw14 [15:50:53] wmnet, kubernetes1016.eqiad.wmnet, mw1481.eqiad.wmnet, mw1477.eqiad.wmnet, mw1467.eqiad.wmnet, mw1385.eqiad.wmnet, kubernetes1051.eqiad.wmnet, mw1361.eqiad.wmnet, mw1375.eqiad.wmnet, pa https://wikitech.wikimedia.org/wiki/PyBal [15:50:56] ouch [15:51:01] oi oi oi [15:51:05] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1010.eqiad.wmnet, mw1492.eqiad.wmnet, mw1442.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1405.eqiad.wmnet, kubernetes1050.eqiad.wmnet, mw1435.eqiad.wmnet, mw1393.eqiad.wmnet, parse1005.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, mw1465.eqiad.wmnet, kubernetes1018. [15:51:05] et, kubernetes1059.eqiad.wmnet, mw1469.eqiad.wmnet, mw1356.eqiad.wmnet, kubernetes1056.eqiad.wmnet, kubernetes1035.eqiad.wmnet, kubernetes1036.eqiad.wmnet, mw1368.eqiad.wmnet, parse1014.eqiad.wmnet, mw1457.eqiad.wmnet, mw1476.eqiad.wmnet, mw1495.eqiad.wmnet, mw1449.eqiad.wmnet, mw1385.eqiad.wmnet, mw1382.eqiad.wmnet, mw1452.eqiad.wmnet, mw1494.eqiad.wmnet, mw1471.eqiad.wmnet, mw1473.eqiad.wmnet, mw1361.eqiad.wmnet, mw1485.eqiad.wmnet, mw [15:51:05] d.wmnet, kubernetes1038.eqiad.wmnet, kubernetes1029.eqiad.wmnet, parse1016.eqiad.wmnet, kubernetes1052.eqiad.wmnet, kubernetes1011.eqiad.wmnet, mw1460.eqiad.wmnet, mw1353.eqiad.wmnet, m https://wikitech.wikimedia.org/wiki/PyBal [15:51:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:51:53] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:52:05] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:53:43] RESOLVED: [17x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:38] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9850519 (10ssingh) [15:54:50] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9850520 (10cmooney) I suspect Brandon may be more versed in the ways of NTP than myself, and could advise if there are any pitfalls on the protocol side. But from my own unders... [15:56:35] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade ssw1-e1-eqiad to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9850549 (10cmooney) [15:56:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:57:19] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9850550 (10BBlack) [15:57:47] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:57:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:58:51] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade ssw1-e1-eqiad to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9850564 (10cmooney) [15:58:55] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9850565 (10cmooney) [15:59:50] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9850567 (10cmooney) [16:00:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:00:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:01:35] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9850574 (10BBlack) Yeah, I've looked at this from the deep-ntp-details POV and it's all pretty sane. We're in alignment with the recommendations in https://www.rfc-editor.org/r... [16:02:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P63775 and previous config saved to /var/cache/conftool/dbconfig/20240531-160236-marostegui.json [16:06:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:06:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:09:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:09:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:11:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:11:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:13:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:13:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:15:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:15:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:17:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:17:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:17:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T364299)', diff saved to https://phabricator.wikimedia.org/P63777 and previous config saved to /var/cache/conftool/dbconfig/20240531-161744-marostegui.json [16:17:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:17:50] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:18:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [16:18:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T364299)', diff saved to https://phabricator.wikimedia.org/P63778 and previous config saved to /var/cache/conftool/dbconfig/20240531-161807-marostegui.json [16:23:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:23:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:24:42] (03PS20) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [16:25:35] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9850699 (10cmooney) [16:25:57] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9850702 (10cmooney) [16:26:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:26:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:26:30] FIRING: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:34] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9850704 (10cmooney) [16:27:53] (03PS1) 10Cwhite: profile: drop all logs from datahub-mae-consumer-main [puppet] - 10https://gerrit.wikimedia.org/r/1037495 (https://phabricator.wikimedia.org/T363856) [16:29:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:29:12] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:30:02] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9850726 (10cmooney) [16:31:10] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9850733 (10cmooney) [16:31:30] FIRING: [3x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:13] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:32:17] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:34:48] (03PS1) 10Cwhite: logstash: drop datahub-mae-consumer-main logs [puppet] - 10https://gerrit.wikimedia.org/r/1037496 (https://phabricator.wikimedia.org/T363856) [16:34:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9850746 (10VRiley-WMF) cloudcephosd1035 Rack: C8 U 28 CableID: 5335 Port: 20 cloudcephosd1036 Rack: D5 U 18 CableID: 5337 Port: 18 cloudcephosd1037 Rack... [16:35:16] (03CR) 10Cwhite: [C:03+2] profile: drop all logs from datahub-mae-consumer-main [puppet] - 10https://gerrit.wikimedia.org/r/1037495 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite) [16:36:30] RESOLVED: [3x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:30] (03CR) 10Cwhite: [C:03+2] logstash: drop datahub-mae-consumer-main logs [puppet] - 10https://gerrit.wikimedia.org/r/1037496 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite) [16:41:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:41:25] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:06:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:06:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:09:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:09:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:11:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:11:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:17:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:19:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:19:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:19:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:19:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:21:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:21:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:23:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:23:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:25:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:25:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:31:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63780 and previous config saved to /var/cache/conftool/dbconfig/20240531-173101-root.json [17:32:31] (03CR) 10BCornwall: [C:03+1] service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [17:33:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:33:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:35:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:35:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:37:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:37:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:39:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:39:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:42:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:44:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:44:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:45:16] cdanis: scap sync-prod-k8s report for future use: https://logstash.wikimedia.org/goto/1b0589dab2f22612e9044ffa9d62ed19 [17:45:44] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63781 and previous config saved to /var/cache/conftool/dbconfig/20240531-174607-root.json [17:46:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:46:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:47:27] dancy: thanks! I figured out a pretty similar query yesterday [17:48:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9851043 (10WDoranWMF) Approved. [17:48:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:48:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:50:44] RESOLVED: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:50:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:56:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:56:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:01:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63782 and previous config saved to /var/cache/conftool/dbconfig/20240531-180113-root.json [18:01:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:01:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:03:07] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9851085 (10cmooney) >>! In T366193#9849495, @ayounsi wrote: > That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or fi... [18:03:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:03:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:07:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:07:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:09:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:09:37] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:11:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:11:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:13:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:13:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P63785 and previous config saved to /var/cache/conftool/dbconfig/20240531-181619-root.json [18:17:38] (03Abandoned) 10JHathaway: vinyl rake task [puppet] - 10https://gerrit.wikimedia.org/r/754116 (owner: 10JHathaway) [18:18:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:18:09] (03Abandoned) 10JHathaway: Rename system::role to base::add_motd_role [puppet] - 10https://gerrit.wikimedia.org/r/764884 (owner: 10JHathaway) [18:18:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:18:31] (03PS2) 10JHathaway: wikipedia.org spf: indicate mail is not sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) [18:22:11] PROBLEM - Disk space on karapace1002 is CRITICAL: DISK CRITICAL - free space: / 154 MB (0% inode=93%): /tmp 154 MB (0% inode=93%): /var/tmp 154 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops [18:22:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:22:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:22:55] (03CR) 10Pppery: [C:04-1] "I have done all of these changes on the translatewiki.net side, and also added some more info about where each message is used. This will " [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper) [18:24:40] (03CR) 10JHathaway: [C:03+1] "Dallas, would love a review, and any insight on who else to include." [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [18:24:47] (03CR) 10JHathaway: "Dallas, would love a review, and any insight on who else to include." [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403) (owner: 10JHathaway) [18:24:54] (03PS2) 10JHathaway: wikipedia.org dmarc: change to quarantine [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403) [18:26:26] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:26:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:29:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:30:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:31:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P63787 and previous config saved to /var/cache/conftool/dbconfig/20240531-183125-root.json [18:31:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:31:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:33:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:33:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:35:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:35:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:40:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:40:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:40:51] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:05] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:43:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9851262 (10MMiller_WMF) I am Sonja's manager and I approve. [18:46:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P63788 and previous config saved to /var/cache/conftool/dbconfig/20240531-184632-root.json [18:46:39] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9851280 (10jhathaway) [18:49:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:50:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:50:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:50:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:52:59] (03CR) 10Dzahn: [C:03+2] admin: add Joely Rooke (WMDE) to ldap_only (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/1037603 (https://phabricator.wikimedia.org/T366145) (owner: 10Dzahn) [18:53:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:53:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:54:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:54:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:55:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:55:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:55:15] !log LDAP - added uid joelyrookewmde to groups wmde and nda (T366145) [18:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:26] T366145: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145 [18:56:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P63789 and previous config saved to /var/cache/conftool/dbconfig/20240531-185613-ladsgroup.json [18:56:38] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9851313 (10Dzahn) 05Open→03Resolved a:03Dzahn This is done. You have been added to the LDAP groups "wmde" and "nda" like other WMDE employees. Feel free to... [18:56:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:56:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:57:32] !log Phabricator - added 'JoelyRooke-WMDE (Jo)' to group WMF-NDA (https://phabricator.wikimedia.org/project/profile/61/) (T366145) [18:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:58:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:59:42] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9851344 (10Dzahn) You have also been added to the "WMF-NDA" group here on Phabricator which let's you see private tickets. (https://phabricator.wikimedia.org/proje... [19:00:13] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:00:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:00:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:01:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P63790 and previous config saved to /var/cache/conftool/dbconfig/20240531-190138-root.json [19:02:19] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:02:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:02:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:03:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:03:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:06:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:06:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:09:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:09:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:10:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:10:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:11:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P63791 and previous config saved to /var/cache/conftool/dbconfig/20240531-191119-ladsgroup.json [19:12:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:12:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:14:43] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9851428 (10Dzahn) 05Resolved→03Open Sorry, I made a mistake thinking the NDA was already done while it's still WIP. Removed from the groups until it's done. [19:15:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:15:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:16:56] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9851431 (10BBlack) Yes, from a resiliency POV, in some senses keeping unicasts in the mix is an answer (and it's the answer we currently rely on). In a world with only very smart and capable resolvers, the simplest answer... [19:17:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:17:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:19:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:19:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:20:27] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:20:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:19] (03CR) 10Dzahn: [C:03+2] "LDAP group change reverted for now since NDA wasn't actually ready yet. But soon it will be, so leaving the puppet part." [puppet] - 10https://gerrit.wikimedia.org/r/1037603 (https://phabricator.wikimedia.org/T366145) (owner: 10Dzahn) [19:21:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:22:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:22:21] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:24:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:26:17] (03PS1) 10Btullis: Temporarily disable XML dumps on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) [19:26:22] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:26:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P63792 and previous config saved to /var/cache/conftool/dbconfig/20240531-192625-ladsgroup.json [19:26:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:28:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:28:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:30:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:30:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:30:53] (03PS2) 10Btullis: Temporarily disable XML dumps on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) [19:32:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2704/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [19:33:22] (03CR) 10Dr0ptp4kt: [C:03+1] Temporarily disable XML dumps on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [19:37:08] (03PS3) 10Btullis: Temporarily disable XML dumps on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) [19:40:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [19:40:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2204.codfw.wmnet with reason: Maintenance [19:40:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P63793 and previous config saved to /var/cache/conftool/dbconfig/20240531-194037-ladsgroup.json [19:40:40] (03CR) 10Dzahn: "The FQDN of the test instance is (currently): phabricator-bullseye.devtools.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:40:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:40:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:40:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:41:31] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2705/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [19:41:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P63794 and previous config saved to /var/cache/conftool/dbconfig/20240531-194131-ladsgroup.json [19:46:03] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [19:47:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:47:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:48:10] (03PS1) 10RLazarus: Release v0.0.5 [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037846 [19:48:32] (03CR) 10Xcollazo: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [19:48:39] (03CR) 10Btullis: [V:03+1 C:03+2] Temporarily disable XML dumps on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [19:48:55] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 3.413 second response time https://wikitech.wikimedia.org/wiki/Karapace [19:49:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:49:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:51:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:51:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:51:35] (03CR) 10Dzahn: "can we just add a class parameter that is true by default and if false skips the whole thing.. then we set it to false in cloud.yaml and i" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:52:16] (03CR) 10RLazarus: [C:03+2] Release v0.0.5 [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037846 (owner: 10RLazarus) [19:52:52] (03PS3) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [19:53:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:53:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:53:25] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:54:51] (03Merged) 10jenkins-bot: Release v0.0.5 [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037846 (owner: 10RLazarus) [19:56:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:56:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:00:21] !log sudo -i reprepro -C main include buster-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.5-1_amd64.changes [20:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:35] !log sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/buster/httpbb_0.0.5-1+deb11u1_amd64.changes [20:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9851493 (10colewhite) [20:05:17] (03PS2) 10Cwhite: admin: add sperry-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036595 (https://phabricator.wikimedia.org/T365766) [20:05:29] (03CR) 10CI reject: [V:04-1] admin: add sperry-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036595 (https://phabricator.wikimedia.org/T365766) (owner: 10Cwhite) [20:06:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:06:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:08:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:08:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:10:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:10:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:11:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:11:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:12:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:12:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:12:54] (03PS6) 10Cwhite: admin: add sperry-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036595 (https://phabricator.wikimedia.org/T365766) [20:14:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:14:37] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:15:17] (03PS1) 10Bking: blazegraph: Add alert for maxlag [alerts] - 10https://gerrit.wikimedia.org/r/1037850 [20:16:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:16:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:18:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:20:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:20:31] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:20:43] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:21:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:22:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:24:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:24:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:26:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:26:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:26:46] (03PS7) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) [20:26:46] (03CR) 10Dzahn: [C:03+1] "no change in prod and fixes puppet failure in cloud with this version: https://puppet-compiler.wmflabs.org/output/1036771/2700/" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [20:28:27] (03CR) 10Dzahn: "here is a similar problem and the approach I mean: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036771" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [20:28:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:28:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:29:07] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [20:29:09] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:29:21] (03CR) 10Dzahn: [C:03+2] "going ahead since it's noop in prod - so fix in testing only" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [20:30:01] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 2.772 second response time https://wikitech.wikimedia.org/wiki/Karapace [20:30:03] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:30:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:30:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:32:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:32:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:34:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:34:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:35:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T364299)', diff saved to https://phabricator.wikimedia.org/P63795 and previous config saved to /var/cache/conftool/dbconfig/20240531-203514-marostegui.json [20:35:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:36:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:37:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:38:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:38:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:40:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:40:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:46:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:46:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:47:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:47:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:48:05] (03CR) 10Dzahn: [C:03+2] "noop in prod confirmed - and in test it's on to the next issue but not this one anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [20:49:17] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:49:23] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:50:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P63796 and previous config saved to /var/cache/conftool/dbconfig/20240531-205022-marostegui.json [20:52:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:52:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:54:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:54:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:56:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:56:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:57:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:58:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:02:52] (03PS2) 10Andrew Bogott: Keystone: remove hack ensuring that project_id == project_name [puppet] - 10https://gerrit.wikimedia.org/r/988052 (https://phabricator.wikimedia.org/T343158) [21:03:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:03:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:03:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:03:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:04:10] (03CR) 10Andrew Bogott: [C:03+2] Keystone: remove hack ensuring that project_id == project_name [puppet] - 10https://gerrit.wikimedia.org/r/988052 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [21:05:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:05:03] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:05:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:05:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:05:13] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [21:05:15] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:05:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P63797 and previous config saved to /var/cache/conftool/dbconfig/20240531-210530-marostegui.json [21:07:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:07:51] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:08:26] (03CR) 10Cwhite: [C:03+2] admin: add sperry-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036595 (https://phabricator.wikimedia.org/T365766) (owner: 10Cwhite) [21:09:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:09:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:12:07] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Karapace [21:12:07] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:04] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9851714 (10colewhite) 05Open→03Resolved a:03colewhite The group membership and ldap change has been deployed. Please feel free to reopen if... [21:16:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Mvolz - https://phabricator.wikimedia.org/T366088#9851723 (10colewhite) 05Open→03Stalled [21:17:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:17:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:17:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9851727 (10colewhite) [21:17:50] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9851728 (10colewhite) a:05colewhite→03None [21:18:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9851734 (10colewhite) a:05colewhite→03None [21:19:03] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9851735 (10colewhite) 05Open→03Stalled [21:19:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:19:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:19:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9851724 (10colewhite) 05Open→03Stalled a:05colewhite→03None [21:20:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T364299)', diff saved to https://phabricator.wikimedia.org/P63798 and previous config saved to /var/cache/conftool/dbconfig/20240531-212038-marostegui.json [21:20:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1228.eqiad.wmnet with reason: Maintenance [21:20:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:20:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1228.eqiad.wmnet with reason: Maintenance [21:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T364299)', diff saved to https://phabricator.wikimedia.org/P63799 and previous config saved to /var/cache/conftool/dbconfig/20240531-212101-marostegui.json [21:21:53] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:21:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:23:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:23:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T364069)', diff saved to https://phabricator.wikimedia.org/P63800 and previous config saved to /var/cache/conftool/dbconfig/20240531-212356-marostegui.json [21:24:01] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:24:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:24:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9851752 (10Dzahn) Added Sonja to the WMF-NDA group here in Phabricator for access to private tickets (https://phabricator.wikimedia.org/project/me... [21:25:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:26:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:28:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:28:07] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:30:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:30:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:31:36] (03PS1) 10Dzahn: gerrit: add an AWS IP to list of misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1037856 (https://phabricator.wikimedia.org/T362401) [21:32:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:32:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:33:45] (03PS2) 10Dzahn: gerrit: add Tencent IP to list of misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1037856 (https://phabricator.wikimedia.org/T362401) [21:34:17] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:34:17] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [21:34:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:34:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:34:39] (03PS3) 10Dzahn: gerrit: add Tencent IP to list of misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1037856 [21:36:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:36:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:36:19] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 9.482 second response time https://wikitech.wikimedia.org/wiki/Karapace [21:36:19] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:37:00] (03PS4) 10Dzahn: gerrit: add Tencent IP to list of misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1037856 [21:38:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:38:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:39:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P63801 and previous config saved to /var/cache/conftool/dbconfig/20240531-213904-marostegui.json [21:39:49] (03PS1) 10BCornwall: Move ncmonitor credentials to its own profile [labs/private] - 10https://gerrit.wikimedia.org/r/1037857 [21:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:41:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:44:23] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [21:44:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:44:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:46:14] (03CR) 10Dzahn: "Don't know why the "Unable to find fact file" in the compiler, since it works with other instances in the same project. But it probably me" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:46:15] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Karapace [21:46:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:46:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:48:46] (03CR) 10Dzahn: "lgtm! just also needed a Phab group to go with it because of https://phabricator.wikimedia.org/T290605" [puppet] - 10https://gerrit.wikimedia.org/r/1036595 (https://phabricator.wikimedia.org/T365766) (owner: 10Cwhite) [21:48:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:48:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:49:19] (03CR) 10Dzahn: [C:03+2] gerrit: add Tencent IP to list of misbehaving bots [puppet] - 10https://gerrit.wikimedia.org/r/1037856 (owner: 10Dzahn) [21:50:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9851813 (10colewhite) [21:50:50] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:50:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:51:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9851817 (10colewhite) Pinging @JayCano for manager approval. Pinging one of @odimitrijevic, @Milimetric, @WDoranWMF, @Ahoelzl for Analytics team approval. [21:52:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:52:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:52:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:52:57] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:53:04] (03PS1) 10Cwhite: admin: add tchanders to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1037497 (https://phabricator.wikimedia.org/T366351) [21:54:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:54:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:54:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P63802 and previous config saved to /var/cache/conftool/dbconfig/20240531-215412-marostegui.json [21:54:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:55:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:55:50] (03Abandoned) 10Aklapper: Remove FIXME comment for waxing and waning moon phases [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper) [21:57:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:57:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:57:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:57:34] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:59:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:59:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:01:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:01:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:01:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:01:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:03:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:03:07] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:03:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:03:37] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:04:33] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [22:05:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:05:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:05:25] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Karapace [22:05:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:05:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:07:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:07:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:08:39] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:09:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T364069)', diff saved to https://phabricator.wikimedia.org/P63803 and previous config saved to /var/cache/conftool/dbconfig/20240531-220920-marostegui.json [22:09:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [22:09:26] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:09:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [22:10:54] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:11:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:12:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:13:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:14:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:15:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:16:58] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9851878 (10Dzahn) 05Open→03In progress [22:17:08] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9851879 (10Dzahn) p:05Triage→03High [22:18:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:19:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:20:04] (03Abandoned) 10Dzahn: gerrit/test: set lfs sync dest host to itself [puppet] - 10https://gerrit.wikimedia.org/r/1037574 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [22:20:45] (03CR) 10Dzahn: [gerrit] Add rsync job for lfs sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [22:21:33] (03PS1) 10Scott French: eventstreams: add securityContext to all production containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037861 (https://phabricator.wikimedia.org/T362978) [22:23:41] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [22:24:37] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 6.042 second response time https://wikitech.wikimedia.org/wiki/Karapace [22:26:21] (03PS2) 10Scott French: similar-users: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) [22:27:27] !log nshahquinn-wmf@deploy1002 Started deploy [airflow-dags/analytics_product@f0284c6]: (no justification provided) [22:27:34] !log nshahquinn-wmf@deploy1002 Finished deploy [airflow-dags/analytics_product@f0284c6]: (no justification provided) (duration: 00m 07s) [22:28:00] (03PS2) 10Scott French: chromium-render: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037196 (https://phabricator.wikimedia.org/T362978) [22:28:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:29:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:29:43] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:29:43] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [22:30:12] (03PS3) 10Scott French: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) [22:30:36] !log nshahquinn-wmf@deploy1002 Started deploy [airflow-dags/analytics_product@f0284c6]: (no justification provided) [22:30:40] !log nshahquinn-wmf@deploy1002 Finished deploy [airflow-dags/analytics_product@f0284c6]: (no justification provided) (duration: 00m 03s) [22:30:43] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:31:35] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Karapace [22:35:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:35:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:36:47] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [22:37:45] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:37:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:39:49] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:39:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:39:54] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:40:47] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:44:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:44:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:45:53] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:46:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:46:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:48:15] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:48:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:50:18] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:50:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:50:47] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 1.909 second response time https://wikitech.wikimedia.org/wiki/Karapace [22:50:47] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:52:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:52:26] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:54:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:54:29] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:56:26] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:56:31] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9851913 (10odimitrijevic) Approved. [22:56:32] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:59:55] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [23:00:51] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 4.792 second response time https://wikitech.wikimedia.org/wiki/Karapace [23:00:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:01:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:05:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:05:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:10:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:10:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:12:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:12:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:13:03] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:13:35] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:13:38] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:13:57] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:14:03] PROBLEM - karapace http server on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Karapace [23:14:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:14:46] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:16:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 911.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:16:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:16:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:17:03] PROBLEM - SSH on karapace1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:17:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:17:45] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:18:34] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:18:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:19:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:19:32] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:20:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:20:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:20:55] RECOVERY - SSH on karapace1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:20:55] RECOVERY - karapace http server on karapace1002 is OK: HTTP OK: HTTP/1.0 200 OK - 379 bytes in 0.911 second response time https://wikitech.wikimedia.org/wiki/Karapace [23:21:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 911.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:21:25] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:21:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:22:29] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:22:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:23:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:23:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:24:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:24:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:26:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:26:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:26:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:26:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:28:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:28:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:30:31] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:30:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:32:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:32:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:34:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:34:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:36:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:36:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037498 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037498 (owner: 10TrainBranchBot) [23:38:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:39:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:41:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:41:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:43:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:43:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:45:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:45:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:47:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:47:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:49:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:49:48] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:52:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:52:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:54:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:54:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:55:37] (03PS1) 10RLazarus: deployment_server: Add a mwscript-k8s cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553)