[00:07:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062473 (owner: 10TrainBranchBot) [00:09:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:27:47] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:56:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:25] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:22:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:07:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:16:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:06:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:05] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062487 [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:22:53] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:37:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:53:49] (03PS1) 10Marostegui: installserver: Do not reimage db2222 [puppet] - 10https://gerrit.wikimedia.org/r/1062593 [06:56:14] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1062593 (owner: 10Marostegui) [06:56:39] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2222 [puppet] - 10https://gerrit.wikimedia.org/r/1062593 (owner: 10Marostegui) [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:23] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: SLIMetricMissing - https://phabricator.wikimedia.org/T372454 (10LSobanski) 03NEW [07:05:08] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: SLIMetricMissing - https://phabricator.wikimedia.org/T372454#10063420 (10LSobanski) There are also (most likely) related "Linting problems found for SLIMetricMissing" alerts. [07:09:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208#10063422 (10ABran-WMF) Sure, let me know when you're up today [07:27:19] (03PS3) 10Klausman: api-gw/liftwing: add missing trailing `/` to path trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) [07:28:43] (03PS4) 10Klausman: api-gw/liftwing: fix prefix/trim for rec-api isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) [07:30:02] (03PS1) 10Ayounsi: Network report, remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062665 [07:30:35] (03PS5) 10Klausman: api-gw/liftwing: fix prefix/trim for rec-api isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) [07:38:13] (03PS1) 10Ayounsi: Remove stray file [puppet] - 10https://gerrit.wikimedia.org/r/1062666 [07:38:56] (03CR) 10Filippo Giunchedi: [C:03+1] Remove stray file [puppet] - 10https://gerrit.wikimedia.org/r/1062666 (owner: 10Ayounsi) [07:41:28] (03CR) 10Klausman: [C:03+2] hiera/manifest/partman: Add DSE node with GPU [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [07:45:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2189.codfw.wmnet with reason: index corruption [07:45:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2189.codfw.wmnet with reason: index corruption [07:46:14] (03PS1) 10Klausman: site.pp: Add ml-labs machine entries for setup [puppet] - 10https://gerrit.wikimedia.org/r/1062667 (https://phabricator.wikimedia.org/T368978) [07:46:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:00:05] jeena and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T0800). nyaa~ [08:04:35] (03CR) 10Ayounsi: [C:03+2] Remove stray file [puppet] - 10https://gerrit.wikimedia.org/r/1062666 (owner: 10Ayounsi) [08:14:53] (03PS1) 10Gmodena: gobblin: remove webrequest_frontend ingestion job. [puppet] - 10https://gerrit.wikimedia.org/r/1062671 (https://phabricator.wikimedia.org/T372456) [08:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:36:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:40:09] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1062671 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena) [08:40:22] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10063565 (10Gehel) p:05Triage→03Medium [08:43:12] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10063577 (10fgiunchedi) [08:43:25] FIRING: SystemdUnitFailed: user@499.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:56] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [08:47:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10063581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye [08:48:25] RESOLVED: SystemdUnitFailed: user@499.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:35] (03PS1) 10Filippo Giunchedi: thanos: temp disable compact [puppet] - 10https://gerrit.wikimedia.org/r/1062678 (https://phabricator.wikimedia.org/T351927) [08:52:09] (03PS3) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) [08:52:27] (03PS4) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) [08:52:40] (03PS1) 10Gmodena: EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) [08:52:54] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3632/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062678 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [08:53:19] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2010.codfw.wmnet with OS bullseye [08:53:29] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10063597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with error... [08:54:30] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:54:55] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:58] (03CR) 10Filippo Giunchedi: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [09:04:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:42] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:02] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: SLIMetricMissing - https://phabricator.wikimedia.org/T372454#10063616 (10fgiunchedi) Thank you @LSobanski! cfr https://phabricator.wikimedia.org/T352756#9984495 for the related activity [09:06:04] (03CR) 10Vgutierrez: [C:03+1] varnish: Set Cache-Control: no-transform header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917954 (https://phabricator.wikimedia.org/T218618) (owner: 10BCornwall) [09:09:25] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:09:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2189.codfw.wmnet with reason: replication still catching up [09:11:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2189.codfw.wmnet with reason: replication still catching up [09:12:31] (03PS4) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) [09:13:15] (03CR) 10Filippo Giunchedi: prometheus: add oauth2-proxy for OIDC authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [09:13:22] (03CR) 10Arnaudb: [C:03+2] mariadb: exclude translate_message_group_subscriptions from replication [puppet] - 10https://gerrit.wikimedia.org/r/1062436 (https://phabricator.wikimedia.org/T372287) (owner: 10Arnaudb) [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:46] (03CR) 10Kamila Součková: [C:03+1] api-gw/liftwing: fix prefix/trim for rec-api isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) (owner: 10Klausman) [09:14:14] (03CR) 10Klausman: [C:03+2] api-gw/liftwing: fix prefix/trim for rec-api isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) (owner: 10Klausman) [09:14:55] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:56] (03Merged) 10jenkins-bot: api-gw/liftwing: fix prefix/trim for rec-api isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062368 (https://phabricator.wikimedia.org/T371465) (owner: 10Klausman) [09:16:55] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [09:17:09] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [09:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:55] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:02] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [09:23:40] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [09:26:08] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [09:26:35] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [09:34:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:40:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:29] here [09:41:30] looks like commons (s4) is having issues again [09:41:36] wooot [09:41:37] checking [09:41:45] urgh [09:42:08] I am the severely undercaffeinated IC [09:42:26] It is the s4 master indeed [09:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:43:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:43:18] marostegui: this also happened yesterday night, people got a process list and strace etc, the incident doc doesn't seem up to date though [09:43:21] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 10.08s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:43:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:44:08] marostegui: I believe c.danis and s.wfrench-wmf left stuff in their home dirs on cumin2002 [09:44:34] * kamila_ updating statuspage [09:44:42] kamila_: let's focus on this incident and not the past one [09:44:53] ok, sure [09:45:42] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:48:15] RESOLVED: [9x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:48:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:48:20] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 21.81s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:49:28] looks like its back [09:50:42] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:51] FIRING: [7x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:50:57] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:57] !incidents [09:50:58] 5010 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [09:50:58] 5011 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:50:58] 5019 (RESOLVED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [09:50:58] 5021 (RESOLVED) db1243 (paged)/MariaDB Replica Lag: s4 (paged) [09:50:59] 5018 (RESOLVED) db1242 (paged)/MariaDB Replica Lag: s4 (paged) [09:50:59] 5020 (RESOLVED) db1241 (paged)/MariaDB Replica Lag: s4 (paged) [09:50:59] 5016 (RESOLVED) db1249 (paged)/MariaDB Replica Lag: s4 (paged) [09:50:59] 5015 (RESOLVED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [09:51:00] 5012 (RESOLVED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [09:51:00] 5014 (RESOLVED) db1238 (paged)/MariaDB Replica Lag: s4 (paged) [09:51:01] 5017 (RESOLVED) db1221 (paged)/MariaDB Replica Lag: s4 (paged) [09:51:01] 5013 (RESOLVED) db1199 (paged)/MariaDB Replica Lag: s4 (paged) [09:51:02] 5022 (RESOLVED) db1190 (paged)/MariaDB Replica Lag: s4 (paged) [09:51:02] 5009 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [09:52:54] (03PS1) 10Klausman: site.pp: Move new ML GPU hosts from insetup to production [puppet] - 10https://gerrit.wikimedia.org/r/1062686 (https://phabricator.wikimedia.org/T368978) [09:52:54] (03CR) 10Klausman: "Not really your neighborhood, so feel free to redirect the review, thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1062686 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [09:53:15] RESOLVED: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:53:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.404s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:53:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:55:13] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3633/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062686 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [09:55:51] RESOLVED: [7x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:56:37] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [09:58:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.398s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1000) [10:00:41] (03PS1) 10Klausman: manifest/hiera/conftool: Add new ML GPU hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1062688 (https://phabricator.wikimedia.org/T368978) [10:02:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10063720 (10klausman) - Labs machine insetup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1062667 (open) -... [10:07:54] (03PS2) 10Klausman: manifest/hiera/conftool: Add new ML GPU hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1062688 (https://phabricator.wikimedia.org/T368978) [10:08:14] (03PS2) 10Klausman: site.pp: Add ml-labs machine entries for setup [puppet] - 10https://gerrit.wikimedia.org/r/1062667 (https://phabricator.wikimedia.org/T368978) [10:08:20] (03CR) 10CI reject: [V:04-1] manifest/hiera/conftool: Add new ML GPU hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1062688 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [10:08:40] (03CR) 10CI reject: [V:04-1] site.pp: Add ml-labs machine entries for setup [puppet] - 10https://gerrit.wikimedia.org/r/1062667 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [10:09:33] (03PS3) 10Klausman: site.pp: Add ml-labs machine entries for setup [puppet] - 10https://gerrit.wikimedia.org/r/1062667 (https://phabricator.wikimedia.org/T368978) [10:09:51] (03PS3) 10Klausman: manifest/hiera/conftool: Add new ML GPU hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1062688 (https://phabricator.wikimedia.org/T368978) [10:29:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:32] (03PS1) 10Ayounsi: Rename netbox extra datasource to netbox extraS [cookbooks] - 10https://gerrit.wikimedia.org/r/1062692 [10:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:56:17] (03PS1) 10Stevemunene: Update airflow version wo fix missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1062693 (https://phabricator.wikimedia.org/T365449) [10:57:03] (03PS1) 10Ayounsi: Provision script: remove additional IPs allocation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062694 (https://phabricator.wikimedia.org/T372461) [10:57:52] (03PS2) 10Stevemunene: Update airflow version to fix missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1062693 (https://phabricator.wikimedia.org/T365449) [11:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1100). [11:04:05] (03CR) 10Gmodena: [C:03+1] data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching [alerts] - 10https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [11:04:46] (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060161 [11:06:16] jouncebot: now [11:06:16] For the next 0 hour(s) and 53 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1100) [11:06:32] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060161 (owner: 10PipelineBot) [11:06:41] (03CR) 10Gmodena: [C:03+1] data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [11:06:53] (03CR) 10Gmodena: [C:03+1] data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [11:07:30] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060161 (owner: 10PipelineBot) [11:11:10] 07sre-alert-triage, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#10063807 (10gmodena) Ack. Thanks for the heads up and CRs @fgiunchedi [11:18:08] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:19:03] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:19:56] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:20:29] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:22:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 1%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67287 and previous config saved to /var/cache/conftool/dbconfig/20240814-112212-arnaudb.json [11:23:01] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:23:29] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:24:35] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1062686 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [11:29:06] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053687 (owner: 10PipelineBot) [11:29:13] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055899 (owner: 10PipelineBot) [11:37:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 2%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67288 and previous config saved to /var/cache/conftool/dbconfig/20240814-113718-arnaudb.json [11:38:26] (03CR) 10Filippo Giunchedi: [C:03+2] data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching [alerts] - 10https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [11:38:35] (03CR) 10Filippo Giunchedi: [C:03+2] "Sure no problem!" [alerts] - 10https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [11:44:59] (03PS1) 10KartikMistry: Use the updated recommendation API from liftwing [extensions/ContentTranslation] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062696 (https://phabricator.wikimedia.org/T371465) [11:45:19] (03PS1) 10KartikMistry: Use the updated recommendation API from liftwing [extensions/ContentTranslation] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062697 (https://phabricator.wikimedia.org/T371465) [11:45:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062697 (https://phabricator.wikimedia.org/T371465) (owner: 10KartikMistry) [11:46:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062696 (https://phabricator.wikimedia.org/T371465) (owner: 10KartikMistry) [11:50:38] (03CR) 10Btullis: [C:03+1] Update airflow version to fix missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1062693 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [11:52:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 4%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67289 and previous config saved to /var/cache/conftool/dbconfig/20240814-115223-arnaudb.json [12:04:04] (03CR) 10Ayounsi: Expose Netbox tunnel data to config templates (038 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [12:04:24] (03CR) 10Ayounsi: [C:03+1] Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [12:07:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 8%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67290 and previous config saved to /var/cache/conftool/dbconfig/20240814-120729-arnaudb.json [12:11:57] (03CR) 10Stevemunene: [C:03+2] Update airflow version to fix missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1062693 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [12:17:36] (03CR) 10Elukey: [C:03+1] Rename netbox extra datasource to netbox extraS [cookbooks] - 10https://gerrit.wikimedia.org/r/1062692 (owner: 10Ayounsi) [12:17:48] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#10063948 (10elukey) @brouberol after T355550 do we have any plans to start testing the upgrade on kafka-test or similar? I can help if needed :) [12:19:23] (03CR) 10Klausman: [V:03+1 C:03+2] site.pp: Move new ML GPU hosts from insetup to production [puppet] - 10https://gerrit.wikimedia.org/r/1062686 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [12:19:42] (03PS1) 10Elukey: blubber: update buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 [12:22:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 16%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67291 and previous config saved to /var/cache/conftool/dbconfig/20240814-122234-arnaudb.json [12:22:35] (03CR) 10Ayounsi: [C:03+2] Rename netbox extra datasource to netbox extraS [cookbooks] - 10https://gerrit.wikimedia.org/r/1062692 (owner: 10Ayounsi) [12:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:28] (03CR) 10Elukey: dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [12:28:33] (03CR) 10Elukey: [C:03+2] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [12:29:54] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10063977 (10phaultfinder) [12:31:34] (03CR) 10Elukey: "Image built locally, tested that the target package was updated (see https://phabricator.wikimedia.org/T372466)" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:35:42] (03Merged) 10jenkins-bot: Rename netbox extra datasource to netbox extraS [cookbooks] - 10https://gerrit.wikimedia.org/r/1062692 (owner: 10Ayounsi) [12:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:37:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 25%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67292 and previous config saved to /var/cache/conftool/dbconfig/20240814-123739-arnaudb.json [12:39:04] (03CR) 10Hnowlan: [C:03+1] blubber: update buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:39:25] (03CR) 10Hnowlan: [C:03+1] blubber: update buildkit version (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:41:19] (03CR) 10Elukey: blubber: update buildkit version (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:41:49] jouncebot: next [12:41:49] In 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1300) [12:42:22] I'm doing +2 for my wmf branch patches, it will take 20-25 minutes. [12:42:57] (03CR) 10KartikMistry: [C:03+2] Use the updated recommendation API from liftwing [extensions/ContentTranslation] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062697 (https://phabricator.wikimedia.org/T371465) (owner: 10KartikMistry) [12:43:02] (03CR) 10KartikMistry: [C:03+2] Use the updated recommendation API from liftwing [extensions/ContentTranslation] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062696 (https://phabricator.wikimedia.org/T371465) (owner: 10KartikMistry) [12:44:47] (03CR) 10Hnowlan: [C:03+1] blubber: update buildkit version (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:45:59] (03CR) 10Elukey: [C:03+2] blubber: update buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:48:56] (03CR) 10CI reject: [V:04-1] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060854 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [12:49:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on 9 hosts with reason: replication table exclusion deployment [12:49:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 9 hosts with reason: replication table exclusion deployment [12:52:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67293 and previous config saved to /var/cache/conftool/dbconfig/20240814-125245-arnaudb.json [12:56:50] (03Merged) 10jenkins-bot: blubber: update buildkit version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1062702 (owner: 10Elukey) [12:57:27] (03CR) 10Elukey: "The interesting bit is that if I try the same iptables trick with the current config, I get errors too:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1300). [13:00:05] kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] o/ [13:00:25] I can deploy but wouldn’t mind if someone else does ^^ [13:00:40] kart_: want to self-serve? [13:00:50] Lucas_WMDE: yeah, will do deploy. [13:00:54] okay! [13:07:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67295 and previous config saved to /var/cache/conftool/dbconfig/20240814-130750-arnaudb.json [13:11:10] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:11:15] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:11:48] (03Merged) 10jenkins-bot: Use the updated recommendation API from liftwing [extensions/ContentTranslation] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062697 (https://phabricator.wikimedia.org/T371465) (owner: 10KartikMistry) [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:03] (03CR) 10Ayounsi: Add function to wmf-netbox plugin to provide QoS config data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:14:26] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:14:38] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:15:07] (03Merged) 10jenkins-bot: Use the updated recommendation API from liftwing [extensions/ContentTranslation] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062696 (https://phabricator.wikimedia.org/T371465) (owner: 10KartikMistry) [13:16:32] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1062697|Use the updated recommendation API from liftwing (T371465)]] [13:16:34] T371465: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465 [13:17:05] hmm. Not sure why scap is pulling wmf.18 change also.. [13:18:47] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:18:53] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:18:58] !log kartik@deploy1003 kartik: Backport for [[gerrit:1062697|Use the updated recommendation API from liftwing (T371465)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:31] !log kartik@deploy1003 kartik: Continuing with sync [13:22:50] !log jayme@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-main2010.codfw.wmnet'] [13:22:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: corrupted index fixed', diff saved to https://phabricator.wikimedia.org/P67296 and previous config saved to /var/cache/conftool/dbconfig/20240814-132256-arnaudb.json [13:25:09] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062697|Use the updated recommendation API from liftwing (T371465)]] (duration: 08m 37s) [13:25:12] T371465: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465 [13:25:52] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1062696|Use the updated recommendation API from liftwing (T371465)]] [13:28:08] !log kartik@deploy1003 kartik: Backport for [[gerrit:1062696|Use the updated recommendation API from liftwing (T371465)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:51] (03PS1) 10Elukey: thumbor: update Docker image for thumbor-plugins [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062706 (https://phabricator.wikimedia.org/T372466) [13:29:23] !log kartik@deploy1003 kartik: Continuing with sync [13:32:44] !log jayme@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-main2010.codfw.wmnet'] [13:32:54] (03PS1) 10Ottomata: gobblin: remove webrequest_frontend_rc0 [puppet] - 10https://gerrit.wikimedia.org/r/1062707 (https://phabricator.wikimedia.org/T372456) [13:33:19] (03CR) 10CI reject: [V:04-1] gobblin: remove webrequest_frontend_rc0 [puppet] - 10https://gerrit.wikimedia.org/r/1062707 (https://phabricator.wikimedia.org/T372456) (owner: 10Ottomata) [13:33:44] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062696|Use the updated recommendation API from liftwing (T371465)]] (duration: 07m 51s) [13:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:33:52] T371465: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465 [13:34:51] (03PS2) 10Ottomata: gobblin: remove webrequest_frontend_rc0 [puppet] - 10https://gerrit.wikimedia.org/r/1062707 (https://phabricator.wikimedia.org/T372456) [13:35:34] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [13:36:20] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1062678 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:39:53] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10064129 (10phaultfinder) [13:43:09] (03CR) 10Hnowlan: [C:03+1] thumbor: update Docker image for thumbor-plugins [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062706 (https://phabricator.wikimedia.org/T372466) (owner: 10Elukey) [13:50:36] (03PS1) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [13:50:38] (03CR) 10Elukey: [C:03+2] thumbor: update Docker image for thumbor-plugins [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062706 (https://phabricator.wikimedia.org/T372466) (owner: 10Elukey) [13:50:41] (03PS1) 10Ilias Sarantopoulos: admin_ng/LiftWing: add article-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062709 (https://phabricator.wikimedia.org/T360455) [13:50:49] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [13:52:12] (03CR) 10CI reject: [V:04-1] opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [13:52:49] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [13:55:03] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [13:55:06] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:56:49] (03PS2) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [13:59:58] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10064161 (10phaultfinder) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1400) [14:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 9.007% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:04:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:13:10] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10064193 (10elukey) Filed https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/24 to fix a conftool issue: ` eluk... [14:13:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 4.79s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:15:15] Here. [14:15:19] !incidents [14:15:19] 5023 (ACKED) db1190 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:19] 5024 (UNACKED) db1221 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:20] 5025 (UNACKED) db1238 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:20] 5026 (UNACKED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:20] 5027 (UNACKED) db1242 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:20] 5028 (UNACKED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:20] 5029 (UNACKED) db1243 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:21] 5030 (UNACKED) db1249 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:21] 5031 (UNACKED) db1241 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:22] 5011 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:15:22] 5010 (RESOLVED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [14:15:23] 5019 (RESOLVED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:23] 5021 (RESOLVED) db1243 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:24] 5018 (RESOLVED) db1242 (paged)/MariaDB Replica Lag: s4 (paged) [14:15:31] !ack 5024 [14:15:34] !ack 5025 [14:15:38] !ack 5026 [14:15:41] !ack 5027 [14:15:47] !ack 5028 [14:15:52] Same issue as yesterday. [14:15:53] !ack 5029 [14:16:04] denisse: and earlier today [14:16:18] !ack 5030 [14:16:20] !ack 5031 [14:16:23] see on security [14:16:42] there is debugging going on on s4 [14:17:08] arnaudb: ACK, thanks! [14:17:24] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [14:17:27] sorry for the pager rage [14:18:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.14% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:21:42] !log ebernhardson@deploy1003 Synchronized private/PrivateSettings.php: Update NetworkSession users list for T341332 (duration: 12m 33s) [14:21:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1 es1029 depooling for hdd hotswap', diff saved to https://phabricator.wikimedia.org/P67299 and previous config saved to /var/cache/conftool/dbconfig/20240814-142147-arnaudb.json [14:21:51] T341332: [EPIC] The CirrusSearch streaming updater should support private wikis - https://phabricator.wikimedia.org/T341332 [14:22:26] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [14:23:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 835.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:24:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:24:30] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:25:50] (03Abandoned) 10Ottomata: gobblin: remove webrequest_frontend_rc0 [puppet] - 10https://gerrit.wikimedia.org/r/1062707 (https://phabricator.wikimedia.org/T372456) (owner: 10Ottomata) [14:27:33] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [14:28:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 1%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67304 and previous config saved to /var/cache/conftool/dbconfig/20240814-142808-arnaudb.json [14:28:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.27% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin) [14:29:15] RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:29:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:11] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [14:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:15] (03PS1) 10AOkoth: vrts: build & install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078) [14:41:38] (03PS2) 10AOkoth: vrts: build & install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078) [14:43:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 2%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67305 and previous config saved to /var/cache/conftool/dbconfig/20240814-144314-arnaudb.json [14:43:19] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bookworm [14:48:45] (03PS88) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [14:49:05] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main2010.codfw.wmnet with OS bookworm [14:53:10] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2010.codfw.wmnet [14:54:52] (03CR) 10CI reject: [V:04-1] vrts: build & install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [14:58:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 4%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67307 and previous config saved to /var/cache/conftool/dbconfig/20240814-145819-arnaudb.json [14:59:06] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2010.codfw.wmnet [14:59:25] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:07:44] Deployment mw-api-ext.eqiad.canary in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-ext&var-deployment=mw-api-ext.eqiad.canary - ... [15:07:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:13:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 8%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67312 and previous config saved to /var/cache/conftool/dbconfig/20240814-151328-arnaudb.json [15:15:39] (03PS1) 10Ilias Sarantopoulos: ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 [15:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:26:40] (03PS1) 10JMeybohm: preseed: Add standard profile for initial imaging [puppet] - 10https://gerrit.wikimedia.org/r/1062722 (https://phabricator.wikimedia.org/T371423) [15:26:41] (03CR) 10AOkoth: [C:03+2] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:27:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:27:44] Deployment mw-api-ext.eqiad.canary in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-ext&var-deployment=mw-api-ext.eqiad.canary - ... [15:27:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:28:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 16%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67314 and previous config saved to /var/cache/conftool/dbconfig/20240814-152833-arnaudb.json [15:29:35] (03CR) 10JMeybohm: [C:03+2] preseed: Add standard profile for initial imaging [puppet] - 10https://gerrit.wikimedia.org/r/1062722 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [15:32:37] (03PS1) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [15:34:21] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [15:34:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:50] (03PS2) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [15:39:05] (03PS1) 10AOkoth: vrts: fix sql export param [puppet] - 10https://gerrit.wikimedia.org/r/1062725 [15:39:16] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [15:39:31] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:39:32] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:39:57] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:39:58] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:40:14] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:43:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67315 and previous config saved to /var/cache/conftool/dbconfig/20240814-154338-arnaudb.json [15:43:39] (03CR) 10AOkoth: [C:03+2] vrts: fix sql export param [puppet] - 10https://gerrit.wikimedia.org/r/1062725 (owner: 10AOkoth) [15:44:37] (03PS1) 10Peter Fischer: Search update pipeline: consume consolidated page-weighted-tags-change-stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062726 (https://phabricator.wikimedia.org/T366253) [15:44:57] (03PS3) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [15:46:15] (03PS4) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [15:47:45] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:47:52] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:54:56] (03PS1) 10Klausman: adming_ng/bgp: Add rows C and D in codfw to IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062728 [15:55:32] (03PS1) 10AOkoth: vrts: remove port from connection string [puppet] - 10https://gerrit.wikimedia.org/r/1062729 (https://phabricator.wikimedia.org/T310822) [15:57:42] (03CR) 10Ayounsi: [C:03+1] adming_ng/bgp: Add rows C and D in codfw to IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062728 (owner: 10Klausman) [15:58:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67316 and previous config saved to /var/cache/conftool/dbconfig/20240814-155844-arnaudb.json [15:59:13] (03CR) 10Klausman: [C:03+2] adming_ng/bgp: Add rows C and D in codfw to IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062728 (owner: 10Klausman) [15:59:29] (03CR) 10AOkoth: [C:03+2] vrts: remove port from connection string [puppet] - 10https://gerrit.wikimedia.org/r/1062729 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [16:01:34] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [16:01:54] (03PS5) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [16:02:26] (03Merged) 10jenkins-bot: adming_ng/bgp: Add rows C and D in codfw to IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062728 (owner: 10Klausman) [16:03:44] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:04:11] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:04:47] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:08:25] (03CR) 10BBlack: [C:03+1] sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 (owner: 10Ssingh) [16:13:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67317 and previous config saved to /var/cache/conftool/dbconfig/20240814-161350-arnaudb.json [16:24:26] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [16:26:41] (03PS1) 10AOkoth: prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) [16:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:02] (03PS2) 10AOkoth: prometheus: add scrape config for vrts sql exporter [puppet] - 10https://gerrit.wikimedia.org/r/1062734 (https://phabricator.wikimedia.org/T310822) [16:28:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: broken disk replaced, slow repooling', diff saved to https://phabricator.wikimedia.org/P67318 and previous config saved to /var/cache/conftool/dbconfig/20240814-162854-arnaudb.json [16:29:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:26] (03PS1) 10Ladsgroup: Avoid primary DB query for non-talk page edits [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062736 (https://phabricator.wikimedia.org/T370304) [16:30:52] jouncebot: nowandnext [16:30:52] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [16:30:53] In 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1700) [16:31:05] (03CR) 10Ladsgroup: [C:03+2] Avoid primary DB query for non-talk page edits [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062736 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:31:39] (03PS1) 10Ladsgroup: Avoid primary DB query for non-talk page edits [extensions/DiscussionTools] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062737 (https://phabricator.wikimedia.org/T370304) [16:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:37:43] (03CR) 10Kevin Bazira: [C:03+1] ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos) [16:38:23] (03PS6) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [16:41:02] (03CR) 10Ladsgroup: [C:03+2] Avoid primary DB query for non-talk page edits [extensions/DiscussionTools] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062737 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:41:06] (03Merged) 10jenkins-bot: Avoid primary DB query for non-talk page edits [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062736 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 15.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:41:34] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1062736|Avoid primary DB query for non-talk page edits (T370304)]] [16:42:39] !log otto@deploy1003 Started deploy [analytics/refinery@f033576] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f0335766] [16:43:28] FIRING: SystemdUnitCrashLoop: logstash.service crashloop on elastic1101:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:43:45] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1062736|Avoid primary DB query for non-talk page edits (T370304)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:45:16] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [16:45:45] !log otto@deploy1003 Finished deploy [analytics/refinery@f033576] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f0335766] (duration: 03m 06s) [16:47:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 14.62s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:47:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:48:24] !log otto@deploy1003 Started deploy [analytics/refinery@f033576] (thin): Regular analytics weekly train THIN [analytics/refinery@f0335766] [16:48:28] FIRING: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1070:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:48:42] !log reran editors_daily_monthly airflow dag with run_id scheduled__2024-06-01T00:00:00+00:00 as part of downstream tasks after rerunning mediawiki_history_denormalize dag [16:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:51:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 19.11% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:51:45] (03Merged) 10jenkins-bot: Avoid primary DB query for non-talk page edits [extensions/DiscussionTools] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1062737 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:52:22] (03PS2) 10Ilias Sarantopoulos: ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 [16:52:37] !log otto@deploy1003 Finished deploy [analytics/refinery@f033576] (thin): Regular analytics weekly train THIN [analytics/refinery@f0335766] (duration: 04m 13s) [16:52:44] !log otto@deploy1003 Started deploy [analytics/refinery@f033576]: Regular analytics weekly train [analytics/refinery@f0335766] [16:52:46] !log reran edit_hourly airflow dag with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot. [16:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:57] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:53:28] FIRING: [8x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1070:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:55:36] !log reran unique_editors_by_country_monthly airflow dag with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot. [16:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:56:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0.4167% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:58:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:58:28] FIRING: [17x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1059:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:58:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10064707 (10JMeybohm) They all failed because the installer tried to bring up some old mdadm arrays and failed doing so, Maybe they where on the new disks, or it is bec... [16:58:45] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:59:33] !log otto@deploy1003 Finished deploy [analytics/refinery@f033576]: Regular analytics weekly train [analytics/refinery@f0335766] (duration: 06m 48s) [17:00:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10064712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye [17:00:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:01:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 19.94% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:02:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.992s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:03:28] RESOLVED: [16x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1059:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:03:30] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:05:27] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1062736|Avoid primary DB query for non-talk page edits (T370304)]], [[gerrit:1062737|Avoid primary DB query for non-talk page edits (T370304)]] [17:05:51] RESOLVED: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:07:38] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1062736|Avoid primary DB query for non-talk page edits (T370304)]], [[gerrit:1062737|Avoid primary DB query for non-talk page edits (T370304)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:08:43] FIRING: [14x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1059:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:08:55] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:08:58] FIRING: [15x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1059:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:09:30] !log reran geoeditors_edits_monthly airflow dag with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot. [17:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:27] !log reran geoeditors_monthly airflow dag with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot. [17:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:22] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062736|Avoid primary DB query for non-talk page edits (T370304)]], [[gerrit:1062737|Avoid primary DB query for non-talk page edits (T370304)]] (duration: 07m 54s) [17:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:43] FIRING: [15x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1054:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:13:46] (03PS7) 10Ssingh: sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 [17:13:58] FIRING: [15x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1054:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:16:41] !log reran geoeditors_public_monthly airflow dag with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot. [17:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:02] (03PS3) 10Ilias Sarantopoulos: ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 [17:17:34] !log otto@deploy1003 Started deploy [airflow-dags/analytics_product@6d50458]: (no justification provided) [17:17:42] !log otto@deploy1003 Finished deploy [airflow-dags/analytics_product@6d50458]: (no justification provided) (duration: 00m 08s) [17:18:43] RESOLVED: [8x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1054:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:43] (03CR) 10BCornwall: [V:03+1] Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [17:22:59] (03CR) 10BBlack: [C:03+1] sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 (owner: 10Ssingh) [17:24:34] (03PS13) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [17:25:52] (03PS14) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [17:26:11] (03CR) 10Ilias Sarantopoulos: "marking this as WIP as the diff shows nothing atm." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos) [17:26:29] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [17:27:33] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: update cookbook (see detailed commit message) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062724 (owner: 10Ssingh) [17:28:48] (03PS3) 10BCornwall: varnish: Set Cache-Control: no-transform header [puppet] - 10https://gerrit.wikimedia.org/r/917954 (https://phabricator.wikimedia.org/T218618) [17:28:55] (03CR) 10BCornwall: varnish: Set Cache-Control: no-transform header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917954 (https://phabricator.wikimedia.org/T218618) (owner: 10BCornwall) [17:29:41] some dummy cookbook runs coming along, no actual change to DNS depooling yet [17:29:44] just as an FYI [17:30:01] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified] [17:30:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified] [17:30:11] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: no reason specified, no task ID specified] [17:30:23] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: pool site eqiad [reason: no reason specified, no task ID specified] [17:30:44] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: testing cookbook, T369366] [17:30:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: testing cookbook, T369366] [17:30:52] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [17:31:03] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified] [17:31:03] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified] [17:31:19] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru for service: text-addrs [reason: no reason specified, no task ID specified] [17:31:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru for service: text-addrs [reason: no reason specified, no task ID specified] [17:31:34] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified] [17:31:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified] [17:32:26] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site esams for service: text-addrs|text-next [reason: no reason specified, no task ID specified] [17:32:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams for service: text-addrs|text-next [reason: no reason specified, no task ID specified] [17:35:31] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: no reason specified, no task ID specified] [17:35:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: no reason specified, no task ID specified] [17:35:44] sorry for the spam folks [17:36:44] (03PS5) 10Ilias Sarantopoulos: ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 [17:37:52] (03CR) 10Ilias Sarantopoulos: "nevermind it was just a PEBCAK: there was no diff because I set exactly the same image version in the new config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos) [17:45:04] (03PS1) 10Ssingh: sre.dns.admin: show changes before and after a state is applied [cookbooks] - 10https://gerrit.wikimedia.org/r/1062744 [17:50:43] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2010.codfw.wmnet with OS bullseye [17:50:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10064969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with error... [17:59:27] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: show changes before and after a state is applied [cookbooks] - 10https://gerrit.wikimedia.org/r/1062744 (owner: 10Ssingh) [18:00:05] jeena and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T1800). [18:01:45] (03PS1) 10Ssingh: sre.dns.admin: fix typo in examples [cookbooks] - 10https://gerrit.wikimedia.org/r/1062745 [18:03:58] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062746 (https://phabricator.wikimedia.org/T366963) [18:04:00] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062746 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [18:04:43] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062746 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [18:14:29] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.18 refs T366963 [18:14:33] T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963 [18:15:36] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: fix typo in examples [cookbooks] - 10https://gerrit.wikimedia.org/r/1062745 (owner: 10Ssingh) [18:38:35] (03CR) 10Dzahn: [C:03+1] vrts: fix sql export param [puppet] - 10https://gerrit.wikimedia.org/r/1062725 (owner: 10AOkoth) [18:59:57] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10065129 (10phaultfinder) [19:00:51] (03CR) 10Dzahn: [V:03+1 C:03+2] "going to apply this only on the inactive host to compare with the active host" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:04:58] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10065143 (10phaultfinder) [19:07:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "syntax simplified so much with nftables! just "srange => [$deployment_server]" replaces "srange => "(@resolve((${deployment_server})) @res" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:20:56] (03CR) 10Dzahn: [V:03+1 C:03+2] "results in /etc/nftables/input/10_miscweb-http-deployment.nft / 10_miscweb-http-envoy.nft all looking good and like in the active system." [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:24:38] (03CR) 10Dzahn: "@hashar all the @resolve as well as repeating things for IPv6 are not needed anymore in firewall syntax with this. both things just happen" [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:26:25] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@6d50458]: Test Refine through Airflow [19:26:37] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@6d50458]: Test Refine through Airflow (duration: 00m 12s) [19:57:55] (03PS6) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) [19:57:55] (03CR) 10Dzahn: [V:04-1] "One blocking issue with that, it doesn't seem possible yet to use the placeholders like CACHES in an srange. I'll remove you again as revi" [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T2000). [20:00:05] arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:03:10] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511 (10RobH) 03NEW [20:06:28] Is anyone around to deploy? [20:07:29] yeah I can [20:07:47] Thank you [20:08:12] arlolra: looks like there is a merge conflict. Can you update your patch? [20:08:55] (03PS2) 10Isabelle Hurbain-Palatin: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 [20:08:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10065323 (10Marostegui) I will leave this host depooled and with mysql down in the EU morning so you could proceed during your Thursday. I will ping you here anyway when it is... [20:09:26] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10065320 (10RobH) a:03colewhite @colewhite, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receivin... [20:09:52] thanks! I'll start the backport now [20:09:58] jeena: Isn't there always a merge conflict in that repo? It should be ok now [20:10:25] 🤷‍♀️ [20:10:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin) [20:11:37] (03Merged) 10jenkins-bot: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin) [20:11:53] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1062037|Activates the "compact" Parsoid indicator on all wikivoyage wikis]] [20:14:03] !log jhuneidi@deploy1003 ihurbain, jhuneidi: Backport for [[gerrit:1062037|Activates the "compact" Parsoid indicator on all wikivoyage wikis]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:14:23] arlolra: ready for any tests you need to do on mwdebug [20:14:30] ok, one sec [20:14:44] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10065333 (10RobH) [20:15:53] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512 (10RobH) 03NEW [20:16:14] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10065355 (10RobH) [20:17:15] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[4-5] - https://phabricator.wikimedia.org/T372512#10065361 (10RobH) a:03colewhite Cole, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the... [20:18:17] jeena: sorry, bear with me [20:18:26] np [20:20:38] jeena: Sorry, I'm not happy with how this is looking, can we roll it back [20:20:50] yup [20:21:04] I won't continue to sync, but we'll need to do a revert as well [20:21:13] i can do it with scap backport if you want [20:21:19] !log jhuneidi@deploy1003 Sync cancelled. [20:21:40] please do [20:22:19] (03PS1) 10Scott French: Add component/php81 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) [20:22:21] (03PS1) 10Scott French: Add pbuilder hook for component/php81 [puppet] - 10https://gerrit.wikimedia.org/r/1062754 (https://phabricator.wikimedia.org/T372507) [20:22:23] (03PS1) 10TrainBranchBot: Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062755 [20:22:23] (03CR) 10TrainBranchBot: "jhuneidi@deploy1003 created a revert of this change as I1351748eb28056b7f62b7081102aa9580b0e1842" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062037 (owner: 10Isabelle Hurbain-Palatin) [20:23:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062755 (owner: 10TrainBranchBot) [20:24:08] (03Merged) 10jenkins-bot: Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062755 (owner: 10TrainBranchBot) [20:24:24] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [20:24:26] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1062755|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] [20:25:42] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514 (10RobH) 03NEW [20:26:11] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10065404 (10RobH) [20:26:33] !log jhuneidi@deploy1003 trainbranchbot, jhuneidi: Backport for [[gerrit:1062755|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:37] !log jhuneidi@deploy1003 trainbranchbot, jhuneidi: Continuing with sync [20:26:48] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10065406 (10RobH) a:03Eevans Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se... [20:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:07] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062755|Revert "Activates the "compact" Parsoid indicator on all wikivoyage wikis"]] (duration: 06m 40s) [20:31:21] jeena: really sorry for the trouble [20:31:22] arlolra: all done [20:31:32] no problem! [20:31:37] thanks for the help [20:31:43] you're welcome :) [20:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 6.679% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:55:42] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:57:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:57:15] Same as before, I've ACK'd the alerts. [20:57:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 18.46s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:57:19] !alerts [20:57:36] !incidents [20:57:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:57:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:58:51] FIRING: [4x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:59:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240814T2100) [21:00:36] (03PS1) 10Cwhite: site: add insetup configs for new logging-hd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1062758 (https://phabricator.wikimedia.org/T372511) [21:00:42] FIRING: [3x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:57] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:02:15] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 15.31s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:03:51] FIRING: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:04:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:04:25] RESOLVED: [3x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:12] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:07:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:07:15] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 2.345s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:07:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [21:07:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [21:08:51] RESOLVED: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:09:15] RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 6.684% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:12:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:15] RESOLVED: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 820ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:40:48] (03CR) 10Mdaniels5757: "very late, but Gerrit is mad at me, so: ack :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [22:05:50] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:07:19] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:15:29] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:17:38] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:28:06] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:28:13] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:31:46] (03CR) 10Ebernhardson: [C:03+2] Search update pipeline: consume consolidated page-weighted-tags-change-stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062726 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [22:32:35] (03CR) 10Ebernhardson: [C:04-2] "needs container version update" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062726 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [22:35:26] (03PS2) 10Ebernhardson: Search update pipeline: consume consolidated page-weighted-tags-change-stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062726 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [22:38:06] (03CR) 10Ebernhardson: [C:03+2] Search update pipeline: consume consolidated page-weighted-tags-change-stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062726 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [22:39:08] (03Merged) 10jenkins-bot: Search update pipeline: consume consolidated page-weighted-tags-change-stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062726 (https://phabricator.wikimedia.org/T366253) (owner: 10Peter Fischer) [22:45:29] (03PS1) 10Ebernhardson: cirrus: Start reading from streams for private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062762 [22:48:07] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:48:12] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:48:29] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [22:50:46] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:50:49] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:50:54] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:51:55] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:52:02] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:52:25] (03CR) 10Ebernhardson: [C:03+2] cirrus: Start reading from streams for private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062762 (owner: 10Ebernhardson) [22:53:22] (03Merged) 10jenkins-bot: cirrus: Start reading from streams for private wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062762 (owner: 10Ebernhardson) [22:56:31] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:56:37] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:01:10] i ran into something unexpected when running the sre.dns.netbox cookbook. i had updated the name on frdb2004 and the description on the mgmt interface to correct a typo from when it was entered. but running the cookbook yields no changes for the mgmt DNS as i would expect. [23:03:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [23:05:17] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [23:05:58] trying again in case there was a timing issue (which i don't suspect will show any change) [23:07:33] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:09:19] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [23:09:35] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:11:25] yeah. no difference. [23:17:30] (03PS1) 10RLazarus: admin: Add jfk to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1062763 [23:18:14] (03PS2) 10RLazarus: admin: Add jfk to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1062763 [23:18:20] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10065649 (10Dwisehaupt) @Papaul @Jhancock.wm I updated the host name in netbox and updated the mgmt interface description to have the correct name and ran the sre.dns.netbox c... [23:25:38] (03PS1) 10Ebernhardson: Revert "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062764 [23:26:18] (03PS2) 10Ebernhardson: Revert "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062764 [23:26:49] (03PS3) 10Ebernhardson: Revert "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062764 [23:26:52] (03CR) 10Jasmine_: [C:03+1] admin: Add jfk to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1062763 (owner: 10RLazarus) [23:26:59] (03CR) 10Ebernhardson: [C:03+2] Revert "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062764 (owner: 10Ebernhardson) [23:28:01] (03Merged) 10jenkins-bot: Revert "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062764 (owner: 10Ebernhardson) [23:30:07] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:30:13] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:33:54] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [23:34:04] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062767 [23:38:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062767 (owner: 10TrainBranchBot) [23:43:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [23:50:01] (03CR) 10RLazarus: [C:03+2] admin: Add jfk to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1062763 (owner: 10RLazarus)