[00:00:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) 05Open→03Resolved @akosiaris this is complete [00:01:54] !log cwhite@deploy1002 Started deploy [releng/phatality@b1a2a70]: T314098 [00:01:58] T314098: Update Phatality to reference ECS fields - https://phabricator.wikimedia.org/T314098 [00:01:58] (03PS2) 10Krinkle: Profiler: Refactor to make suitable for CLI again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874419 (https://phabricator.wikimedia.org/T253547) [00:02:09] !log cwhite@deploy1002 Finished deploy [releng/phatality@b1a2a70]: T314098 (duration: 00m 14s) [00:02:53] * Krinkle testing on mwdebug1002 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874419 [00:02:59] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: / 1093 MB (2% inode=98%): /tmp 1093 MB (2% inode=98%): /var/tmp 1093 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [00:03:51] !log krinkle@deploy1002 Locking from deployment [ALL REPOSITORIES]: staging config patch --krinkle (planned duration: 60m 00s) [00:04:19] * Krinkle trying out scap-lock instead of `touch` global-lock directly [00:04:32] ok, it logs and is interactive. Interesting. [00:07:07] (03CR) 10Krinkle: [C: 03+2] Profiler: Refactor to make suitable for CLI again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874419 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [00:07:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P44657 and previous config saved to /var/cache/conftool/dbconfig/20230215-000717-ladsgroup.json [00:07:46] (03Merged) 10jenkins-bot: Profiler: Refactor to make suitable for CLI again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874419 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [00:08:13] !log krinkle@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: staging config patch --krinkle (duration: 04m 22s) [00:15:34] !log krinkle@deploy1002 Synchronized src/Profiler.php: Ife7bded7480946c (duration: 07m 05s) [00:19:59] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [00:22:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44658 and previous config saved to /var/cache/conftool/dbconfig/20230215-002224-ladsgroup.json [00:22:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [00:22:29] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [00:22:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [00:22:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T328255)', diff saved to https://phabricator.wikimedia.org/P44659 and previous config saved to /var/cache/conftool/dbconfig/20230215-002245-ladsgroup.json [00:23:33] (03PS4) 10Krinkle: multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) [00:23:47] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [00:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T328255)', diff saved to https://phabricator.wikimedia.org/P44660 and previous config saved to /var/cache/conftool/dbconfig/20230215-002552-ladsgroup.json [00:26:00] (03CR) 10Krinkle: [C: 03+2] multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:26:19] (03PS3) 10Krinkle: speed-tests: Add captureSpeedtest.php script and publish 2023 snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888105 [00:26:39] (03Merged) 10jenkins-bot: multiversion: Remove getCachableMWConfig in favour of getConfigGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885066 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:35:22] !log krinkle@deploy1002 Synchronized multiversion/: I3144a56e17ecb (duration: 06m 51s) [00:40:46] (03CR) 10Krinkle: [C: 03+2] speed-tests: Add captureSpeedtest.php script and publish 2023 snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888105 (owner: 10Krinkle) [00:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P44661 and previous config saved to /var/cache/conftool/dbconfig/20230215-004058-ladsgroup.json [00:41:25] (03Merged) 10jenkins-bot: speed-tests: Add captureSpeedtest.php script and publish 2023 snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888105 (owner: 10Krinkle) [00:46:25] (03PS4) 10Zabe: Stop trying to read from rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889266 (https://phabricator.wikimedia.org/T299954) [00:47:12] !log krinkle@deploy1002 Synchronized wmf-config/: I3144a56e17ecb (duration: 06m 33s) [00:49:04] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route check 2 services: maintenance [00:49:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 2 services: maintenance [00:56:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P44662 and previous config saved to /var/cache/conftool/dbconfig/20230215-005604-ladsgroup.json [01:07:29] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:11:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T328255)', diff saved to https://phabricator.wikimedia.org/P44663 and previous config saved to /var/cache/conftool/dbconfig/20230215-011110-ladsgroup.json [01:11:16] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [01:11:38] (03CR) 10Zabe: [C: 03+2] Stop trying to read from rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889266 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [01:12:15] (03Merged) 10jenkins-bot: Stop trying to read from rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889266 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [01:15:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:35] zabe: are you deploying? [01:20:12] Krinkle, yes [01:21:58] (03PS1) 10Ssingh: Revert "hiera: temporarily remove references to dns4004" [puppet] - 10https://gerrit.wikimedia.org/r/889269 [01:22:04] !log zabe@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T299954 (duration: 06m 50s) [01:22:07] k, I'm not around much longer but if you feel supported by SRE or DBA at this time, go ahead :) [01:22:08] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [01:22:18] it's pretty late everywhere [01:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:22:35] I'm done [01:23:00] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily remove references to dns4004" [puppet] - 10https://gerrit.wikimedia.org/r/889269 (owner: 10Ssingh) [01:23:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS buster [01:23:39] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster [01:24:09] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:27:41] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:27:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:19] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:28:29] (03PS5) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [01:28:35] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:28:38] ^ expected [01:29:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39618/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [01:34:46] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [01:34:48] (03PS1) 10Ssingh: P:dns::recursor: set webserver port to 9199 [puppet] - 10https://gerrit.wikimedia.org/r/889288 (https://phabricator.wikimedia.org/T321309) [01:34:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:36:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Papaul) @Jhancock.wm the racking proposal in the description of this task says: ` wdqs201[3-5] will replace wdqs200[4-6]. wdqs20[16-22] will be net-new hosts.... [01:41:28] (03CR) 10Ssingh: [V: 03+1] "There is more to be done here before we can expose /metrics." [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [01:41:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [01:44:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [01:52:44] PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [01:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:58:28] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:03:31] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1001.eqiad.wmnet with OS bullseye [02:04:14] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:29] (JobUnavailable) firing: (13) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:14] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICALError response or zero answers: https://wikitech.wikimedia.org/wiki/DNS [02:10:43] ^ expected [02:12:32] (03PS1) 10Ssingh: bird: remove validate_cmd (possibly temporary; debugging reimaging) [puppet] - 10https://gerrit.wikimedia.org/r/889291 [02:12:53] (03CR) 10CI reject: [V: 04-1] bird: remove validate_cmd (possibly temporary; debugging reimaging) [puppet] - 10https://gerrit.wikimedia.org/r/889291 (owner: 10Ssingh) [02:13:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39619/console" [puppet] - 10https://gerrit.wikimedia.org/r/889291 (owner: 10Ssingh) [02:14:48] (03PS2) 10Ssingh: bird: remove validate_cmd (possibly temporary; debugging reimaging) [puppet] - 10https://gerrit.wikimedia.org/r/889291 [02:15:44] (03CR) 10Ssingh: [C: 03+2] bird: remove validate_cmd (possibly temporary; debugging reimaging) [puppet] - 10https://gerrit.wikimedia.org/r/889291 (owner: 10Ssingh) [02:19:12] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:14] (JobUnavailable) firing: (13) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:29] (JobUnavailable) firing: (13) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:48] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:54:40] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [03:08:58] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) [03:18:40] (03PS1) 10Ssingh: Revert "Revert "hiera: temporarily remove references to dns4004"" [puppet] - 10https://gerrit.wikimedia.org/r/889270 [03:19:09] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "hiera: temporarily remove references to dns4004"" [puppet] - 10https://gerrit.wikimedia.org/r/889270 (owner: 10Ssingh) [03:22:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns4004.wikimedia.org with reason: Puppet failure during reimaging [03:22:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns4004.wikimedia.org with reason: Puppet failure during reimaging [03:31:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:42:48] (03PS1) 10Andrea Denisse: quickdatacopy: Add support to open files with O_NOATIME [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) [03:46:43] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39620/console" [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [04:01:01] (03CR) 10Andrea Denisse: "PCC results: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39620/console" [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [04:27:05] !log fab@deploy1002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [04:27:17] !log fab@deploy1002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 11s) [04:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:02:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [05:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:34:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:12:20] (03PS36) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [06:24:14] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T0700) [07:18:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/888722 (https://phabricator.wikimedia.org/T324475) (owner: 10Muehlenhoff) [07:19:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10MoritzMuehlenhoff) 05Open→03Resolved This has been removed. [07:31:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:54:46] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:56:43] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [07:58:13] (03PS10) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [07:59:09] !log rolling upgrade to HAProxy 2.6.8-2 in cp nodes [07:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:58] PROBLEM - puppet last run on an-presto1006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:06:06] RECOVERY - puppet last run on an-presto1006 is OK: OK: Puppet is currently disabled (Create presto cluster for perf testing - T329525 - nfraison), not alerting. Last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:06:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2001.codfw.wmnet with reason: host reimage [08:06:56] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [08:06:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:07:00] (03PS1) 10Vgutierrez: cache::haproxy: Update to 2.6.8-2 globally [puppet] - 10https://gerrit.wikimedia.org/r/889475 (https://phabricator.wikimedia.org/T321775) [08:07:29] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:09:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2001.codfw.wmnet with reason: host reimage [08:10:40] (03PS1) 10Muehlenhoff: Switch urldownloader in codfw to 2001 [dns] - 10https://gerrit.wikimedia.org/r/889477 (https://phabricator.wikimedia.org/T327991) [08:12:09] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39622/console" [puppet] - 10https://gerrit.wikimedia.org/r/889475 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [08:17:29] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:18:11] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Update to 2.6.8-2 globally [puppet] - 10https://gerrit.wikimedia.org/r/889475 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [08:25:58] !log cdanis@cumin1001 START - Cookbook sre.ganeti.reimage for host aux-k8s-worker1001.eqiad.wmnet with OS bullseye [08:26:32] (03PS2) 10Filippo Giunchedi: admin: add jon-amar-wmde [puppet] - 10https://gerrit.wikimedia.org/r/888167 (https://phabricator.wikimedia.org/T329324) [08:27:21] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add jon-amar-wmde [puppet] - 10https://gerrit.wikimedia.org/r/888167 (https://phabricator.wikimedia.org/T329324) (owner: 10Filippo Giunchedi) [08:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:30:25] 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10fgiunchedi) 05Open→03Resolved Thank you @KFrancis ! @jon_amar-WMDE I've added you to `nda` and `wmde` ldap groups, I'm resolving the task though feel free to reopen as needed! [08:30:32] 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10fgiunchedi) [08:30:55] (03PS1) 10Elukey: role::etcd::v3::ml_etcd::staging: remove bootstrap flag [puppet] - 10https://gerrit.wikimedia.org/r/889479 (https://phabricator.wikimedia.org/T327767) [08:31:58] (KubernetesCalicoDown) firing: aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:32:56] (03CR) 10Klausman: [C: 03+1] role::etcd::v3::ml_etcd::staging: remove bootstrap flag [puppet] - 10https://gerrit.wikimedia.org/r/889479 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:33:16] (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd::staging: remove bootstrap flag [puppet] - 10https://gerrit.wikimedia.org/r/889479 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:36:25] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1001.eqiad.wmnet with reason: host reimage [08:36:47] this is me, I am progressing Chris' cookbook --^ [08:39:26] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1001.eqiad.wmnet with reason: host reimage [08:40:24] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Remove broken HeaderName/ReadmeName for arclamp file listing [puppet] - 10https://gerrit.wikimedia.org/r/889161 (owner: 10Krinkle) [08:45:27] (03CR) 10Filippo Giunchedi: [C: 03+1] quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683) (owner: 10Andrea Denisse) [08:45:36] (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: Show transfer progress when using quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [08:48:33] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Vgutierrez) It looks more like a generic response to delegate a whole domain to shopify than what we are currently doing with store.wikimedia.org. As @Dzahn mentioned, we got store.wi... [08:52:19] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [08:52:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) >>! In T329272#8614270, @jbond wrote: >> alarms: true we can set based on the device model (false by default as we... [08:53:39] !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aux-k8s-worker1001.eqiad.wmnet with OS bullseye [08:54:51] !log cdanis@cumin1001 START - Cookbook sre.ganeti.reimage for host aux-k8s-worker1002.eqiad.wmnet with OS bullseye [08:55:17] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [08:55:50] !log truncate -s 2GB /srv/log/swift/server.log.1 on thanos-be2001 to free space in / [08:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:56] (03CR) 10Elukey: [C: 03+1] admin_ng: update aux's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889194 (owner: 10CDanis) [09:00:04] hashar and ^demon: Your horoscope predicts another unfortunate MediaWiki train - Utc-0+Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T0900). [09:01:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:05:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [09:05:56] (03PS1) 10KartikMistry: Update cxserver to 2023-02-15-085109-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889483 (https://phabricator.wikimedia.org/T328310) [09:06:24] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: host reimage [09:06:58] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [09:09:31] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: host reimage [09:11:18] 10SRE, 10Patch-For-Review: Rsync quickdatacopy doesn't show progress during transfer - https://phabricator.wikimedia.org/T329683 (10Aklapper) [09:11:23] 10SRE, 10Patch-For-Review: Rsync quickdatacopy copies files with atime creating a huge number of iops and a slow sync - https://phabricator.wikimedia.org/T329695 (10Aklapper) [09:16:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2002.codfw.wmnet with reason: host reimage [09:17:42] (03PS2) 10Muehlenhoff: Fail over to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/889153 [09:19:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2002.codfw.wmnet with reason: host reimage [09:19:45] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) 05In progress→03Resolved We are now running 2.6.8-2~bpo10+1 globally [09:19:58] (03CR) 10Muehlenhoff: [C: 03+2] Fail over to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/889153 (owner: 10Muehlenhoff) [09:22:08] (03CR) 10Filippo Giunchedi: "Idea LGTM, though as far as I can see --open-noatime is not a thing on Buster (e.g. on centrallog1001) so it wouldn't work fleet wide just" [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [09:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:23:28] (03PS5) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [09:23:42] !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aux-k8s-worker1002.eqiad.wmnet with OS bullseye [09:23:45] !log cdanis@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: upgrade to v1.23 [09:26:11] (03CR) 10David Caro: puppet: improvements to replica_cnf_api functional tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [09:29:34] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Install software version upgrade [09:30:11] (03CR) 10Filippo Giunchedi: quickdatacopy: Add support to open files with O_NOATIME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [09:32:04] (03PS2) 10DCausse: rdf-streaming-updater: Increase memory limit from 2 to 4GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [09:32:47] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Install software version upgrade [09:33:15] (03PS2) 10Muehlenhoff: Switch urldownloader in codfw to 2001 [dns] - 10https://gerrit.wikimedia.org/r/889477 (https://phabricator.wikimedia.org/T327991) [09:33:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:33:51] (03PS1) 10JMeybohm: Kubernets masters: include profile::kubernetes::client [puppet] - 10https://gerrit.wikimedia.org/r/889486 (https://phabricator.wikimedia.org/T307943) [09:34:01] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [09:34:46] (03CR) 10Filippo Giunchedi: quickdatacopy: Add support to open files with O_NOATIME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [09:35:01] (03PS2) 10JMeybohm: Kubernetes masters: include profile::kubernetes::client [puppet] - 10https://gerrit.wikimedia.org/r/889486 (https://phabricator.wikimedia.org/T307943) [09:37:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39623/console" [puppet] - 10https://gerrit.wikimedia.org/r/889486 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:39:08] (03CR) 10Elukey: [C: 03+1] Kubernetes masters: include profile::kubernetes::client [puppet] - 10https://gerrit.wikimedia.org/r/889486 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:40:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [09:41:25] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2003.codfw.wmnet with OS bullseye [09:41:44] (03CR) 10Muehlenhoff: [C: 03+2] Switch urldownloader in codfw to 2001 [dns] - 10https://gerrit.wikimedia.org/r/889477 (https://phabricator.wikimedia.org/T327991) (owner: 10Muehlenhoff) [09:43:10] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [09:45:49] (03CR) 10Clément Goubert: "This change is ready for review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [09:47:04] (03CR) 10CI reject: [V: 04-1] mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [09:49:56] (03PS3) 10Clément Goubert: mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) [09:51:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-etcd2003.codfw.wmnet with reason: host reimage [09:52:01] (03PS1) 10JMeybohm: Make aux-k8s infrastructure_user tokens uniqe [labs/private] - 10https://gerrit.wikimedia.org/r/889492 [09:52:31] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Make aux-k8s infrastructure_user tokens uniqe [labs/private] - 10https://gerrit.wikimedia.org/r/889492 (owner: 10JMeybohm) [09:53:03] (03CR) 10Aqu: "I'm not sure about the condition." [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [09:53:14] (03CR) 10CI reject: [V: 04-1] mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [09:54:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-etcd2003.codfw.wmnet with reason: host reimage [09:54:21] !log installing openjdk-11 security updates [09:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:45] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39625/console" [puppet] - 10https://gerrit.wikimedia.org/r/889486 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:55:03] (03PS1) 10MVernon: Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 [09:55:25] (03CR) 10CI reject: [V: 04-1] Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (owner: 10MVernon) [09:55:46] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Kubernetes masters: include profile::kubernetes::client [puppet] - 10https://gerrit.wikimedia.org/r/889486 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:56:28] (03PS2) 10MVernon: Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (https://phabricator.wikimedia.org/T279621) [09:56:59] (03CR) 10CI reject: [V: 04-1] Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:57:57] (03PS3) 10MVernon: Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (https://phabricator.wikimedia.org/T279621) [09:58:58] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:05:33] (03PS1) 10Filippo Giunchedi: hieradata: set logs-api in 'production' [puppet] - 10https://gerrit.wikimedia.org/r/889494 (https://phabricator.wikimedia.org/T320702) [10:07:12] (03PS14) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [10:08:12] (03PS2) 10Filippo Giunchedi: hieradata: set logs-api in 'production' [puppet] - 10https://gerrit.wikimedia.org/r/889494 (https://phabricator.wikimedia.org/T320702) [10:09:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-etcd2003.codfw.wmnet with OS bullseye [10:10:34] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-ctrl2001.codfw.wmnet with OS bullseye [10:10:50] 10SRE-swift-storage: Thanos root filesystem filling with logs - https://phabricator.wikimedia.org/T329712 (10MatthewVernon) [10:11:28] (03PS4) 10MVernon: Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (https://phabricator.wikimedia.org/T279621) [10:11:56] (03PS1) 10MVernon: swift: nodelaycompress for swift logs [puppet] - 10https://gerrit.wikimedia.org/r/889497 (https://phabricator.wikimedia.org/T329712) [10:13:10] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Install software version upgrade [10:13:23] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:13:26] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:13:31] 10SRE-swift-storage, 10Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10MatthewVernon) 05Open→03Stalled Marking this as stalled, as we've had to restore thanos-be[1,2]004 to thanos use, meaning we can't progress this task until next fiscal year when... [10:14:12] (03CR) 10MVernon: [C: 03+2] Revert "thanos: drain thanos-be[1,2]004" [puppet] - 10https://gerrit.wikimedia.org/r/889277 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:14:26] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:14:28] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:14:55] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:14:57] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:15:28] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:15:32] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:15:40] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:17:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:17:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [10:18:16] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: nodelaycompress for swift logs [puppet] - 10https://gerrit.wikimedia.org/r/889497 (https://phabricator.wikimedia.org/T329712) (owner: 10MVernon) [10:18:26] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198 (owner: 10Herron) [10:19:08] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) [10:19:14] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:20:36] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [10:21:00] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [10:21:10] 10SRE-swift-storage, 10Patch-For-Review: Thanos root filesystem filling with logs - https://phabricator.wikimedia.org/T329712 (10ops-monitoring-bot) Host rebooted by mvernon@cumin1001 with reason: clear up damage from full / [10:21:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: host reimage [10:21:58] (KubernetesCalicoDown) resolved: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:22:18] (03CR) 10Elukey: [C: 03+2] admin_ng: update aux's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889194 (owner: 10CDanis) [10:23:58] (03PS4) 10Clément Goubert: mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) [10:24:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: host reimage [10:25:45] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:26:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:29:02] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:29:05] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:31:14] (03PS5) 10Clément Goubert: mysql_legacy: remove x2 handling logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) [10:31:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:31:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:33:33] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:36:26] (03CR) 10Jbond: "lgtm just a minor nit to make sure we trigger ci" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [10:36:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:38:24] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:05] !log discard /var/spool/rsyslog on thanos-be2001 T329712 [10:39:05] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:39:07] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:10] T329712: Thanos root filesystem filling with logs - https://phabricator.wikimedia.org/T329712 [10:39:40] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:39:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-ctrl2001.codfw.wmnet with OS bullseye [10:40:12] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-ctrl2002.codfw.wmnet with OS bullseye [10:41:58] (KubernetesCalicoDown) resolved: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:41:58] (KubernetesCalicoDown) resolved: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:42:22] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [10:45:20] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2001.codfw.wmnet [10:47:33] (KubernetesRsyslogDown) firing: (3) rsyslog on ml-staging-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:47:53] (03CR) 10Btullis: [V: 03+1] Do not install spark2 on bullseye or later (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:49:14] (KubernetesCalicoDown) firing: (3) ml-staging-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:50:14] (03PS2) 10Btullis: Do not install spark2 on bullseye or later [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) [10:51:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: host reimage [10:52:25] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:52:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:54:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: host reimage [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1100) [11:03:02] 10SRE-tools, 10Infrastructure-Foundations: improvments to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) p:05Triage→03Medium [11:08:32] (03CR) 10Hnowlan: [C: 03+2] Bump Thumbor minor version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan) [11:09:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ml-staging-ctrl2002.codfw.wmnet with OS bullseye [11:09:48] 10SRE-tools, 10Infrastructure-Foundations: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10Aklapper) [11:10:31] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bullseye [11:13:18] PROBLEM - Host ml-staging2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:15:58] (KubernetesCalicoDown) firing: aux-k8s-worker1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:16:25] (03Merged) 10jenkins-bot: Bump Thumbor minor version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan) [11:17:29] (KubernetesCalicoDown) firing: (3) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:17:29] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:17:38] (03CR) 10Jbond: "hmm i tested this on my lapto and the purge function hangs, see paste https://phabricator.wikimedia.org/P44664" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [11:18:32] RECOVERY - Host ml-staging2001 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [11:20:55] (03CR) 10MVernon: [C: 03+2] swift: nodelaycompress for swift logs [puppet] - 10https://gerrit.wikimedia.org/r/889497 (https://phabricator.wikimedia.org/T329712) (owner: 10MVernon) [11:20:58] (KubernetesCalicoDown) resolved: aux-k8s-worker1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:21:55] !log thanos-be2001 rm /srv/swift-storage/sda3/tmp/c10e5844-1b19-4c8d-b474-801ad3dd6849 [11:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:10] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [11:22:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:22:42] !log thanos-be2001 rm /srv/swift-storage/sda3/tmp/b0e33b98-f8be-409b-a9d2-246ad5812db0 [11:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:02] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-airflow1005.eqiad.wmnet with OS bullseye [11:24:44] (03CR) 10Muehlenhoff: Purge unused kernels on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [11:25:04] PROBLEM - Host ml-staging2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:26] (03CR) 10Jbond: Purge unused kernels on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [11:27:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [11:27:22] RECOVERY - Host ml-staging2001 is UP: PING OK - Packet loss = 0%, RTA = 31.83 ms [11:32:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [11:35:37] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) [11:38:36] (03CR) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:45:50] (03CR) 10Vgutierrez: [C: 03+1] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [11:45:55] (03CR) 10Muehlenhoff: [C: 03+2] Remove further files related to removed pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [11:49:19] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [11:49:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS bullseye [11:49:39] (03PS1) 10Samtar: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889528 (https://phabricator.wikimedia.org/T328224) [11:51:03] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [11:52:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:04:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10ayounsi) > i was also curious what is the is_pool property used for? It's best effort when created prefixes and not used for anything, it can be... [12:09:19] (03PS1) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [12:12:07] (03PS1) 10EoghanGaffney: Add puppet role to new aphlict VM [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) [12:15:05] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39627/console" [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:15:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10Ladsgroup) FWIW, mw should not send this many cross-dc connections to databases but I assume it's a different aspect of this problem. [12:16:07] (03CR) 10Ladsgroup: "Manuel is out, I review this." [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [12:17:19] (03CR) 10Ladsgroup: [C: 03+1] "the db part looks good to me." [software/spicerack] - 10https://gerrit.wikimedia.org/r/889490 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [12:17:29] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:21:53] (03CR) 10Jelto: [C: 03+2] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [12:24:01] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4004.wikimedia.org with OS buster [12:24:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster executed with errors: - dns4004 (**FAIL**) - Downtimed on... [12:27:29] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:28:38] (03PS9) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [12:33:09] (03CR) 10CI reject: [V: 04-1] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [12:36:55] (03PS1) 10Jcrespo: mediabackups: Test backup update on testwiki for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/889534 (https://phabricator.wikimedia.org/T327157) [12:43:06] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/889507 (https://phabricator.wikimedia.org/T329730) [12:47:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T329730 [12:47:08] T329730: Switchover s3 master (db2127 -> db2105) - https://phabricator.wikimedia.org/T329730 [12:47:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T329730 [12:47:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2105 with weight 0 T329730', diff saved to https://phabricator.wikimedia.org/P44666 and previous config saved to /var/cache/conftool/dbconfig/20230215-124729-ladsgroup.json [12:48:37] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10fgiunchedi) This is a first try/iteration I attempted today, and came up with this (with newlines for clarit... [12:48:39] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [12:48:54] (03PS2) 10Jelto: jenkins: fix directory in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) [12:49:22] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [12:50:02] (03PS1) 10Cathal Mooney: Add fgoodwin@wikimedia.org to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/889535 (https://phabricator.wikimedia.org/T329404) [12:51:24] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:51:25] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Test backup update on testwiki for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/889534 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [12:51:47] (03PS11) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [12:52:28] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:44] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:52:54] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:53:07] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39629/console" [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:54:14] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:55:24] (03CR) 10Jelto: jenkins: fix directory in sudo rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [12:56:01] jouncebot: nowandnext [12:56:02] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [12:56:02] In 1 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1400) [12:57:29] (KubernetesRsyslogDown) firing: (3) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:05:23] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/889507 (https://phabricator.wikimedia.org/T329730) (owner: 10Gerrit maintenance bot) [13:05:32] (03PS2) 10Ladsgroup: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/889507 (https://phabricator.wikimedia.org/T329730) (owner: 10Gerrit maintenance bot) [13:05:34] (03CR) 10Ladsgroup: [V: 03+2] mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/889507 (https://phabricator.wikimedia.org/T329730) (owner: 10Gerrit maintenance bot) [13:06:16] !log Starting s3 codfw failover from db2127 to db2105 - T329730 [13:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:20] T329730: Switchover s3 master (db2127 -> db2105) - https://phabricator.wikimedia.org/T329730 [13:06:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2105 to s3 primary T329730', diff saved to https://phabricator.wikimedia.org/P44667 and previous config saved to /var/cache/conftool/dbconfig/20230215-130653-ladsgroup.json [13:07:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889535 (https://phabricator.wikimedia.org/T329404) (owner: 10Cathal Mooney) [13:07:43] (03CR) 10Cathal Mooney: [C: 03+2] Add fgoodwin@wikimedia.org to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/889535 (https://phabricator.wikimedia.org/T329404) (owner: 10Cathal Mooney) [13:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2127 T329730', diff saved to https://phabricator.wikimedia.org/P44668 and previous config saved to /var/cache/conftool/dbconfig/20230215-130822-ladsgroup.json [13:11:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-data for Fgoodwin - https://phabricator.wikimedia.org/T329404 (10cmooney) @FGoodwin sorry about the delay on this one. I've added you to the analytics-privatedata-users LDAP group now, and also added a Kerberos principal for you. I believe... [13:14:06] (03CR) 10Slyngshede: [C: 03+2] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:15:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:15:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:15:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:15:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:21:06] jouncebot: nowandnext [13:21:06] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [13:21:06] In 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1400) [13:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:24:59] any objections to me running `mwscript extensions/WikimediaMaintenance/createExtensionTables.php newiki pageassessments` now, prior to the deployment window which needs it? (ref T328224) [13:25:00] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [13:27:12] (03PS2) 10Ladsgroup: [WIP] mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) [13:27:17] !log `[samtar@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php newiki pageassessments` T328224 [13:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:26] (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert) [13:34:14] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:34:59] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 [13:36:50] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 (owner: 10Jbond) [13:38:12] ACKNOWLEDGEMENT - Check systemd state on idm-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-idm.service Slyngshede Waiting for new version of Bitu IDM https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:59] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 [13:40:05] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bullseye [13:40:55] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 (owner: 10Jbond) [13:42:12] (03PS5) 10Clément Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [13:42:49] PROBLEM - Host ml-staging2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:10] (03CR) 10Clément Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [13:44:14] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:46:38] (03Abandoned) 10Jelto: gitlab_runner: add option to drop Docker capabilities [puppet] - 10https://gerrit.wikimedia.org/r/773746 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:47:08] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39630/console" [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [13:47:10] (03PS2) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/888671 (owner: 10L10n-bot) [13:47:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:48:02] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/888671 (owner: 10L10n-bot) [13:48:19] RECOVERY - Host ml-staging2002 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [13:49:03] (03PS2) 10Samtar: Enable DiscussionTools on mobile at almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889204 (https://phabricator.wikimedia.org/T328940) (owner: 10Bartosz Dziewoński) [13:52:32] (03CR) 10Jaime Nuche: [C: 03+1] "The change makes sense to me. +1 in case that helps move things along. Seems like feedback from Antoine is still required though." [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [13:53:19] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 (https://phabricator.wikimedia.org/T329722) [13:53:44] MatmaRex: we can probably begin a little early if you'd like? [13:53:55] (03PS6) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [13:55:08] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: always download latest file [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [13:57:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1400). [14:00:05] MatmaRex and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] hi [14:00:11] * TheresNoTime can deploy [14:00:16] TheresNoTime: sorry, i was away D: [14:00:32] it's okay :D going to start with 889204: Enable DiscussionTools on mobile at almost all wikis [14:00:52] the backports are just for the maint script, otherwise no-ops [14:00:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889204 (https://phabricator.wikimedia.org/T328940) (owner: 10Bartosz Dziewoński) [14:01:07] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) > in the same situation of above, if selecting download from dell it then fails because tries to install the same version and apparently the iDRAC is... [14:01:10] (03PS3) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [14:01:33] (03Merged) 10jenkins-bot: Enable DiscussionTools on mobile at almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889204 (https://phabricator.wikimedia.org/T328940) (owner: 10Bartosz Dziewoński) [14:02:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [14:02:03] (03CR) 10Jbond: "This works well for bios and drac, where the is normally only one download file but is not so great for nics where there are multiple down" [cookbooks] - 10https://gerrit.wikimedia.org/r/889541 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond) [14:02:07] !log samtar@deploy1002 Started scap: Backport for [[gerrit:889204|Enable DiscussionTools on mobile at almost all wikis (T328940)]] [14:02:11] T328940: [Config Change] Enable all DiscussionTools as default-on features at Phase 1 wikis (mobile) - https://phabricator.wikimedia.org/T328940 [14:02:11] (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889267 (https://phabricator.wikimedia.org/T329627) (owner: 10Bartosz Dziewoński) [14:02:21] (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/DiscussionTools] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889268 (https://phabricator.wikimedia.org/T329627) (owner: 10Bartosz Dziewoński) [14:03:59] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:889204|Enable DiscussionTools on mobile at almost all wikis (T328940)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:04:06] MatmaRex: 889204 is live on mwdebug, can you test? [14:04:56] TheresNoTime: yep. looks good [14:05:06] cool :) [14:05:22] will just run the 2 backports through and then start the script [14:05:51] thanks [14:07:53] (03Merged) 10jenkins-bot: persistRevisionThreadItems: Avoid listing non-discussion pages [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889267 (https://phabricator.wikimedia.org/T329627) (owner: 10Bartosz Dziewoński) [14:07:56] (03Merged) 10jenkins-bot: persistRevisionThreadItems: Avoid listing non-discussion pages [extensions/DiscussionTools] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889268 (https://phabricator.wikimedia.org/T329627) (owner: 10Bartosz Dziewoński) [14:10:26] (03PS2) 10Samtar: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889528 (https://phabricator.wikimedia.org/T328224) [14:10:56] (03CR) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [14:11:21] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:889204|Enable DiscussionTools on mobile at almost all wikis (T328940)]] (duration: 09m 13s) [14:11:26] T328940: [Config Change] Enable all DiscussionTools as default-on features at Phase 1 wikis (mobile) - https://phabricator.wikimedia.org/T328940 [14:11:53] !log samtar@deploy1002 Started scap: Backport for [[gerrit:889267|persistRevisionThreadItems: Avoid listing non-discussion pages (T329627)]], [[gerrit:889268|persistRevisionThreadItems: Avoid listing non-discussion pages (T329627)]] [14:11:57] T329627: Optimize permalinks maintenance script to avoid listing non-discussion pages - https://phabricator.wikimedia.org/T329627 [14:12:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:12:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:13:34] !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:13:45] !log samtar@deploy1002 matmarex and samtar: Backport for [[gerrit:889267|persistRevisionThreadItems: Avoid listing non-discussion pages (T329627)]], [[gerrit:889268|persistRevisionThreadItems: Avoid listing non-discussion pages (T329627)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:13:59] (syncing) [14:14:10] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:14:39] (03PS1) 10Jelto: gitlab: remove dedicated restore logfile and log to syslog only [puppet] - 10https://gerrit.wikimedia.org/r/889546 (https://phabricator.wikimedia.org/T326315) [14:19:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bullseye [14:19:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:19:28] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:889267|persistRevisionThreadItems: Avoid listing non-discussion pages (T329627)]], [[gerrit:889268|persistRevisionThreadItems: Avoid listing non-discussion pages (T329627)]] (duration: 07m 34s) [14:19:30] (03PS3) 10Bking: rdf-streaming-updater: Increase memory limit from 2 to 4GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) [14:19:32] T329627: Optimize permalinks maintenance script to avoid listing non-discussion pages - https://phabricator.wikimedia.org/T329627 [14:19:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:19:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:19:56] !log `samtar@mwmaint1002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwikinews --current --all | tee persistRevisionThreadItems.out.txt` in screen session `25805.T315510` for T315510 [14:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:00] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:20:19] MatmaRex: done, script running ^ [14:20:31] (03CR) 10Bking: rdf-streaming-updater: Increase memory limit from 2 to 4GiB (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [14:21:04] thanks TheresNoTime! [14:21:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889528 (https://phabricator.wikimedia.org/T328224) (owner: 10Samtar) [14:22:03] !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:22:14] (03Merged) 10jenkins-bot: InitialiseSettings: install PageAssessments on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889528 (https://phabricator.wikimedia.org/T328224) (owner: 10Samtar) [14:22:27] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:22:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:22:40] !log samtar@deploy1002 Started scap: Backport for [[gerrit:889528|InitialiseSettings: install PageAssessments on newiki (T328224)]] [14:22:44] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [14:24:27] !log samtar@deploy1002 samtar: Backport for [[gerrit:889528|InitialiseSettings: install PageAssessments on newiki (T328224)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:24:33] (testing) [14:25:23] (syncing) [14:26:49] (03CR) 10EoghanGaffney: [V: 03+1] "The puppet catalog diff looks as expected, and doesn't seem to hardcode anything to the wrong hostname (this tallies with the quick check " [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [14:27:25] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:28:37] (Nonwrite HTTP requests with primary DB connections alert) firing: Nonwrite HTTP requests with primary DB connections alert - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [14:29:22] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:29:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:29:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:30:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:30:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:30:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:30:45] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:31:06] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:889528|InitialiseSettings: install PageAssessments on newiki (T328224)]] (duration: 08m 25s) [14:31:10] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [14:31:16] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:31:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:31:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:32:13] !log closing UTC afternoon backport window [14:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:29] (KubernetesRsyslogDown) resolved: (3) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:32:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Hokwelum) Hello, just checking in to find out what is going on with the OS installation on dumpsdata1006, and please, when will it be ready for use? [14:33:45] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:33:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:33:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:34:14] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:36:33] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Score: "FileBackendError: Iterator page I/O error" on a page on Beta Cluster - https://phabricator.wikimedia.org/T329744 (10matmarex) [14:37:29] (KubernetesCalicoDown) resolved: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:38:35] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:38:39] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:38:49] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:39:12] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:43:07] (03CR) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [14:43:27] (03CR) 10Nicolas Fraison: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [14:44:04] (03CR) 10Herron: [C: 03+1] hieradata: set logs-api in 'production' [puppet] - 10https://gerrit.wikimedia.org/r/889494 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [14:44:14] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Jclark-ctr) [14:48:13] (03CR) 10Herron: [C: 03+1] quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683) (owner: 10Andrea Denisse) [14:50:18] (03CR) 10Herron: [C: 03+1] centrallog: Show transfer progress when using quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [14:50:41] (03CR) 10Herron: [C: 03+2] pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198 (owner: 10Herron) [14:51:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:51:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:51:37] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/889492 (owner: 10JMeybohm) [14:55:21] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:55:26] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) @Vgutierrez, thanks for your share. May I bug you for a comprehensive summary if possible, of what is needed, what I should put more pressure on, what to say no, etc, so I can... [14:55:40] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1001.eqiad.wmnet'] [14:55:58] (03CR) 10Andrea Denisse: [C: 03+2] quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683) (owner: 10Andrea Denisse) [14:57:18] (03CR) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [14:57:35] (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Show transfer progress when using quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [14:58:38] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) Circling back on the network side config now that there are a few patches out to improve the server side.... [14:59:14] (KubernetesCalicoDown) resolved: ml-staging-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:01:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:01:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:01:49] (03CR) 10CDanis: [C: 03+1] Make aux-k8s infrastructure_user tokens uniqe [labs/private] - 10https://gerrit.wikimedia.org/r/889492 (owner: 10JMeybohm) [15:02:07] 10SRE, 10Patch-For-Review: Rsync quickdatacopy doesn't show progress during transfer - https://phabricator.wikimedia.org/T329683 (10andrea.denisse) 05Open→03Resolved [15:03:37] (Nonwrite HTTP requests with primary DB connections alert) firing: (2) Nonwrite HTTP requests with primary DB connections alert - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [15:03:55] jouncebot: nowandnext [15:03:55] No deployments scheduled for the next 2 hour(s) and 56 minute(s) [15:03:56] In 2 hour(s) and 56 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1800) [15:04:01] gooood [15:09:19] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Vgutierrez) >>! In T128559#8618724, @SHust wrote: > @Vgutierrez, thanks for your share. May I bug you for a comprehensive summary if possible, of what is needed, what I should put mor... [15:11:51] (03PS1) 10Ladsgroup: Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) [15:12:17] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10CDanis) Thanks! Let's give it a try :) [15:12:24] (03CR) 10CI reject: [V: 04-1] Migrate EventLogging config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889557 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [15:13:01] (03CR) 10Atieno: [C: 04-1] "nitpick: Typo in commit description "endagered" did we mean "endangered"..?" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887) (owner: 10Hnowlan) [15:13:19] (03CR) 10JMeybohm: [C: 03+1] rdf-streaming-updater: Increase memory limit from 2 to 4GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [15:14:09] (03PS2) 10Hnowlan: imagemagick: use JSON output from exiftool [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887) [15:14:26] (03CR) 10Hnowlan: imagemagick: use JSON output from exiftool (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/883564 (https://phabricator.wikimedia.org/T327887) (owner: 10Hnowlan) [15:15:25] (03PS4) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [15:15:28] (03CR) 10JHathaway: Purge unused kernels on boot (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [15:15:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44670 and previous config saved to /var/cache/conftool/dbconfig/20230215-151545-ladsgroup.json [15:16:13] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Increase memory limit from 2 to 4GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [15:18:37] (Nonwrite HTTP requests with primary DB connections alert) resolved: Nonwrite HTTP requests with primary DB connections alert - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [15:24:58] !log bking@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:26:55] !log bking@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:26:56] (03CR) 10CDanis: "Seems overall reasonable to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [15:27:19] !log bking@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:27:39] !log bking@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:29:05] (03CR) 10JHathaway: Add jaeger chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [15:29:37] !log bking@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:30:22] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44671 and previous config saved to /var/cache/conftool/dbconfig/20230215-153050-ladsgroup.json [15:30:58] (KubernetesCalicoDown) firing: ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:31:48] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) [15:32:23] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) [15:33:08] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:33:28] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:33:36] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route check 2 services: maintenance [15:33:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 2 services: maintenance [15:37:19] (03PS2) 10JHathaway: CI runner: skip helm library charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 [15:37:21] (03PS3) 10JHathaway: Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 [15:37:23] (03PS2) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 [15:39:01] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance [15:39:02] !log eevans@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool sessionstore in codfw: maintenance [15:39:42] (03PS1) 10Ssingh: P:dns::auth::update: add safe.directory for gdnsd bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/889560 (https://phabricator.wikimedia.org/T321309) [15:39:48] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route check 2 services: maintenance [15:39:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 2 services: maintenance [15:40:50] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39631/console" [puppet] - 10https://gerrit.wikimedia.org/r/889560 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:40:57] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:40:58] (KubernetesCalicoDown) resolved: ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:41:29] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance [15:41:30] !log eevans@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool sessionstore in codfw: maintenance [15:42:15] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway) [15:42:20] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:42:40] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10LSobanski) [15:45:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44672 and previous config saved to /var/cache/conftool/dbconfig/20230215-154555-ladsgroup.json [15:47:33] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:48:54] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:49:00] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:49:28] (03PS1) 10Jgreen: Remove icinga monitoring of frpm1001.frack.eqiad.wmnet for decom. [puppet] - 10https://gerrit.wikimedia.org/r/889561 (https://phabricator.wikimedia.org/T329752) [15:49:55] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:53:16] (03CR) 10Jgreen: [C: 03+2] Remove icinga monitoring of frpm1001.frack.eqiad.wmnet for decom. [puppet] - 10https://gerrit.wikimedia.org/r/889561 (https://phabricator.wikimedia.org/T329752) (owner: 10Jgreen) [15:55:22] (03PS1) 10Krinkle: build: Restore 'buildConfigCache' call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889562 (https://phabricator.wikimedia.org/T329518) [15:56:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 35467 [15:56:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 35467 [16:01:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44673 and previous config saved to /var/cache/conftool/dbconfig/20230215-160100-ladsgroup.json [16:02:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889562 (https://phabricator.wikimedia.org/T329518) (owner: 10Krinkle) [16:02:11] (03PS1) 10Jbond: posix_acl: add module to manage posix file system ACLs [puppet] - 10https://gerrit.wikimedia.org/r/889563 (https://phabricator.wikimedia.org/T113979) [16:03:28] jouncebot: now [16:03:28] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [16:03:30] !log restart ci jenkins for updates [16:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:21] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-airflow1005.eqiad.wmnet [16:12:21] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-airflow1005.eqiad.wmnet [16:12:53] !log bking@cumin1001 START - Cookbook sre.ganeti.reimage for host an-airflow1005.eqiad.wmnet with OS buster [16:16:35] !log installing gnutls28 security updates [16:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:30] (03PS1) 10Filippo Giunchedi: alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) [16:18:03] (03CR) 10CI reject: [V: 04-1] alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [16:19:02] (03PS2) 10Filippo Giunchedi: alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) [16:19:05] fixing row alignment CI errors, yum yum [16:21:29] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage [16:23:15] (03PS1) 10Ssingh: Revert "Revert "Revert "hiera: temporarily remove references to dns4004""" [puppet] - 10https://gerrit.wikimedia.org/r/889283 [16:23:34] what's the record for the most reverts? :P [16:24:01] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage [16:25:01] (03CR) 10CDanis: [C: 03+1] alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [16:28:48] sukhe: my favourite is gotta be this guy Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [16:29:11] will try/not try to beat this :) [16:29:26] I think the record is like six [16:30:22] cdanis: thank you for the review -- I'm not feeling lucky today so I'll merge next thing tomorrow EU morning [16:30:32] though if you do feel free to merge [16:30:38] godog: ok! waiting sounds good to me [16:30:54] * godog nods [16:31:04] godog: one other thing, I started wondering if there was some way to write a multiline text/template definition and then strip newlines at the end, but then I got a headache [16:31:09] like robert-redford-nod.gifv [16:31:31] cdanis: yeah easy to believe :| afaict there isn't but I'd love to be wrong [16:31:38] yeah I did not see an easy way [16:36:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:37:12] ^ expected [16:38:58] (KubernetesCalicoDown) firing: aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:42:58] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas [16:43:55] !log restarting Exim on MXes to pick up gnutls security updates [16:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:58] (KubernetesCalicoDown) resolved: aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:45:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas [16:46:42] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [16:50:21] (03PS1) 10Vgutierrez: icinga: Use check_ssl_http_letsencrypt for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/889571 [16:52:22] !log installing postgresql-11 security updates [16:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:27] (03PS1) 10Btullis: Add a postgresql database and user for airflow_search_platform [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) [16:54:44] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39633/console" [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [16:54:47] (03CR) 10Btullis: Add a postgresql database and user for airflow_search_platform [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [16:56:51] PROBLEM - Check systemd state on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:41] RECOVERY - Check systemd state on aux-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889560 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:08:51] (03CR) 10Muehlenhoff: Purge unused kernels on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [17:18:01] !log disable puppet on A:dns-auth: merging CR 889560 [17:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:16] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: add safe.directory for gdnsd bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/889560 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:24:12] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10brennen) [17:24:23] !log [done] disable puppet on A:dns-auth: merging CR 889560 [17:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:26] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10demon) [17:26:31] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Krd) https://commons.wikimedia.org/wiki/... [17:30:27] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) Central Notice * Banner updated ** [[ https://meta.wikimedia.org/w/index.php?title=MediaWiki%3ACentral... [17:31:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS buster [17:31:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster [17:32:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack dnsdisc.Discovery attempts to query depooled/disabled dns auth servers - https://phabricator.wikimedia.org/T329773 (10CDanis) [17:35:29] (03PS1) 10Nicolas Fraison: chore(presto): remove useless gc tag PrintGCApplicationConcurrentTime [puppet] - 10https://gerrit.wikimedia.org/r/889580 (https://phabricator.wikimedia.org/T329525) [17:35:31] (03PS1) 10Nicolas Fraison: chore(presto): remove configuration tuning added while trying to increase cluster size [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) [17:35:33] (03PS1) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [17:35:35] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:35:35] (03PS1) 10Nicolas Fraison: chore(presto): remove useless kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [17:35:38] ^ expected [17:35:47] please disregard any DNS alerts for the meantime [17:35:59] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:01] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:13] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:36:15] (03CR) 10Nicolas Fraison: [C: 03+1] Add a postgresql database and user for airflow_search_platform [puppet] - 10https://gerrit.wikimedia.org/r/889572 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [17:37:33] (03CR) 10CI reject: [V: 04-1] chore(presto): remove configuration tuning added while trying to increase cluster size [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [17:38:14] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:57] PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:22] (03PS2) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [17:41:17] 10Puppet, 10Infrastructure-Foundations: Tidy up the taskgen script - https://phabricator.wikimedia.org/T329777 (10jbond) p:05Triage→03Low [17:41:37] (03PS2) 10Nicolas Fraison: perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) [17:41:39] (03PS2) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [17:41:41] (03PS3) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [17:42:05] (03Abandoned) 10Nicolas Fraison: chore(varnishkafa): add site to VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887784 (owner: 10Nicolas Fraison) [17:42:49] (03CR) 10Jbond: Purge unused kernels on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [17:43:14] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:43:35] (03CR) 10CI reject: [V: 04-1] perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [17:44:41] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10thcipriani) @demon looking into what's need on the GitLab side; maybe "just" configuration 😂 [17:47:33] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet [17:48:39] RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 70.92 ms [17:49:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [17:51:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [17:52:49] (03CR) 10Herron: [C: 03+1] alertmanager: tweak default incident text/description [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [17:52:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet [17:53:15] (03PS1) 10Majavah: P:toolforge: k8s: haproxy: increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/889585 [17:55:24] (03PS2) 10Majavah: P:toolforge: k8s: haproxy: increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/889585 [17:56:05] (03PS5) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [17:56:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39639/console" [puppet] - 10https://gerrit.wikimedia.org/r/889585 (owner: 10Majavah) [17:57:18] (03CR) 10JHathaway: Purge unused kernels on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [17:58:45] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889585 (owner: 10Majavah) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1800) [18:00:37] (03CR) 10David Caro: [C: 03+2] P:toolforge: k8s: haproxy: increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/889585 (owner: 10Majavah) [18:03:14] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:08:07] (03CR) 10Muehlenhoff: Purge unused kernels on boot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [18:12:20] !log installing curl security updates on buster [18:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:33] !log installing curl security updates on bullseye (not buster) [18:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:05] PROBLEM - Check systemd state on cp4046 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [18:22:29] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:22:39] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:41] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:41] RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:22:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:03] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4004.wikimedia.org with OS buster [18:24:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster completed: - dns4004 (**PASS**) - Downtimed on Icinga/Aler... [18:28:14] (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:27] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "Revert "hiera: temporarily remove references to dns4004""" [puppet] - 10https://gerrit.wikimedia.org/r/889283 (owner: 10Ssingh) [18:33:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:38:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:41:28] !log dummy entry [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:22] (03PS1) 10Ssingh: P:dns::auth::update: remove trailing slash for git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/889590 (https://phabricator.wikimedia.org/T321309) [18:50:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889590 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:51:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39640/console" [puppet] - 10https://gerrit.wikimedia.org/r/889590 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:52:34] (03CR) 10BCornwall: [C: 03+1] P:dns::auth::update: remove trailing slash for git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/889590 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:55:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: remove trailing slash for git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/889590 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:00:05] dduvall and ^demon: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1900) [19:08:47] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889598 (https://phabricator.wikimedia.org/T325586) [19:08:49] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889598 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [19:09:33] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889598 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [19:15:35] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/889573 (owner: 10Clément Goubert) [19:16:58] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.23 refs T325586 [19:17:02] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:17:20] !log large spike in undefined property errors. rolling back (T325586) [19:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:26] !log correction: spike may be temporary. holding (T325586) [19:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:59] (03CR) 10Ssingh: [C: 03+1] "Didn't pick up dns4004 when it should not have but picked it up when it should :) https://puppet-compiler.wmflabs.org/output/889568/39641/" [puppet] - 10https://gerrit.wikimedia.org/r/889568 (https://phabricator.wikimedia.org/T329773) (owner: 10Clément Goubert) [19:21:46] (03CR) 10RLazarus: [C: 03+1] "Whoops, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/889573 (owner: 10Clément Goubert) [19:23:35] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.23 refs T325586 (duration: 06m 36s) [19:23:39] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:23:50] !log rolling back due to spike in parsoid errors (T325586) [19:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:01] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.40.0-wmf.23" [19:36:26] (03PS1) 10Dduvall: Revert "group1 wikis to 1.40.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889604 [19:36:28] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.40.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889604 (owner: 10Dduvall) [19:36:33] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route check 2 services: maintenance [19:36:33] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 2 services: maintenance [19:36:57] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: Depooling while we attempt to reproduce errors — T327954 [19:37:01] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [19:37:03] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.40.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889604 (owner: 10Dduvall) [19:39:57] zabe: ty for triaging T329740 <3 [19:39:58] T329740: PHP Notice: Undefined property: stdClass::$href - https://phabricator.wikimedia.org/T329740 [19:42:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: Depooling while we attempt to reproduce errors — T327954 [19:42:04] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [19:48:34] !log setting Cassandra query trace probability to 0.25 on sessionstore cluster, codfw datacenter — T327954 [19:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:38] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [19:57:25] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [20:02:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [20:03:00] yw [20:04:46] !log setting Cassandra query trace probability to 0 (disabled) on sessionstore cluster, codfw datacenter — T327954 [20:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:50] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [20:06:41] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask [20:10:56] 10SRE, 10SRE-OnFire, 10Observability-Alerting: Vopsbot doesn't have channel topic rights - https://phabricator.wikimedia.org/T329791 (10RLazarus) [20:13:19] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [20:14:01] PROBLEM - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [20:15:45] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [20:17:20] (03PS1) 10Majavah: P:mail::mx: drop smtp_ldap_password [puppet] - 10https://gerrit.wikimedia.org/r/889628 [20:23:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [20:28:37] (03PS1) 10Majavah: P:configmaster:: add conditional for abuse_nets link [puppet] - 10https://gerrit.wikimedia.org/r/889632 [20:28:57] (03CR) 10CI reject: [V: 04-1] P:configmaster:: add conditional for abuse_nets link [puppet] - 10https://gerrit.wikimedia.org/r/889632 (owner: 10Majavah) [20:29:29] (03PS2) 10Majavah: P:configmaster:: add conditional for abuse_nets link [puppet] - 10https://gerrit.wikimedia.org/r/889632 [20:30:28] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Score: "FileBackendError: Iterator page I/O error" on a page on Beta Cluster - https://phabricator.wikimedia.org/T329744 (10TheresNoTime) 05Open→03Resolved Now seems to be working, and it probably(?) had something to do with {T329... [20:33:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [20:36:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:39:15] !log rebooting sessionstore2001 w/o cookbook — T327954 [20:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:18] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [20:49:09] (03PS1) 10Cathal Mooney: Default L2 interfaces to MTU 9212 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329535) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:01:56] I agree! [21:03:08] If you're bored TheresNoTime I can find some patches to backport :) [21:03:20] herzog: possibly! [21:03:24] hehe [21:04:26] * TheresNoTime will be around for 30 minutes or so if anyone does have any patches :) [21:07:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:09:32] I'm going to deploy a new scap release real quick. [21:09:40] sure :) [21:09:47] !log dancy@deploy1002 Installing scap version "4.35.0" for 563 hosts [21:09:57] PROBLEM - IPMI Sensor Status on db2163 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:10:14] !log dancy@deploy1002 Installation of scap version "4.35.0" completed for 563 hosts [21:10:20] Done. [21:11:59] dduvall, o/ quick sync discussion on the parsoid notices? [21:12:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:13:24] subbu: sure! [21:13:48] okay. so, https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/889637 on the parsoid side will fix all the notices. [21:14:06] we can tag it, update vendor and have a new vendor patch merged. [21:14:17] but, we don't know what is involved after that wrt to getting that code deployed / backported. [21:14:58] vendor is branched with everything else, so it should be comparable to doing a normal backport deploy [21:15:02] I assume there is a way for pushing a new vendor release in between train deploys to fix issues ... [21:15:06] i can handle that part if you like [21:15:18] yes, that would be helpful. [21:15:29] !log rebooting sessionstore2001 w/o cookbook — T327954 [21:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:33] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [21:15:39] yeah, it's all under the same mediawiki tree in the version branch [21:15:51] alright, i will +2 scott's patch and we will get on with getting a new vendor patch in gerrit and merged. [21:16:12] i.e. https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/wmf/1.40.0-wmf.23/.gitmodules#789 [21:16:27] subbu: excellent. thanks! [21:17:25] I don't fully understand the deployment and backporting process .. one of these days, I should probably sit down and get that in my head. :) [21:17:28] gives me a chance to try out `scap backport` which i haven't used in prod yet :) [21:17:51] by the time you understand it, we will hopefully have changed it to something equally obscure and hard to reason about [21:18:00] lol. [21:18:03] (03CR) 10Cwhite: "Even if the performance impact is limited, I think this change is reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/889294 (https://phabricator.wikimedia.org/T329695) (owner: 10Andrea Denisse) [21:18:04] j/k something nice :) [21:19:09] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/889494 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [21:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:55] dduvall: are you about to do a backport? :) [21:24:25] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200) https://www.mediawiki.org/wiki/Kask [21:25:06] TheresNoTime: hopefully, yes. fixing vendor so train can proceed :) [21:25:20] er, a fix for parsoid in vendor that is [21:25:42] okay ^^ [21:26:15] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [21:29:40] (03PS2) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) [21:30:03] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [21:30:56] subbu: i'll be on the lookup for the vendor patch and go from there [21:31:01] *lookout [21:31:23] jouncebot: now [21:31:23] For the next 0 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T2100) [21:32:07] TheresNoTime: sorry, am i holding you up at all? feel free to proceed with backports if there are more [21:32:23] nope, nothing in the queue today, just wanted to check :) [21:32:31] ah, got it. okie dokie [21:33:38] (03CR) 10Raymond Ndibe: "had a bit of set back with this. turns out setup_file and teardown_file doesn't behave in the manner we expect. I eventually had to remove" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [21:39:02] dduvall, will do. [21:40:36] (03CR) 10Cwhite: [C: 03+1] "Much shorter, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/889567 (https://phabricator.wikimedia.org/T317240) (owner: 10Filippo Giunchedi) [21:40:45] RECOVERY - IPMI Sensor Status on db2163 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:44:13] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [21:56:08] 10SRE, 10Traffic-Icebox: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10BCornwall) 05Open→03Declined As we no longer use ats-tls, this issue doesn't seem relevant any more. [21:57:50] (03PS2) 10Cathal Mooney: Default L2 interfaces to MTU 9212 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329535) [21:58:17] 10SRE, 10Infrastructure-Foundations, 10netops: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) p:05Triage→03Medium [21:58:27] (03PS3) 10Cathal Mooney: Default L2 interfaces to MTU 9212 if not set from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/889635 (https://phabricator.wikimedia.org/T329799) [21:58:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:59:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10BCornwall) 05Open→03Resolved Looks like this can be resolved! [21:59:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) [22:01:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) The above patch addresses the issue by ensuring Homer adds an MTU of 9192 on any L2 switch ports which don't have... [22:05:31] 10SRE, 10Data-Engineering: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10BCornwall) [22:06:24] 10SRE, 10Data-Engineering: Add backend field to webrequest Hive table - https://phabricator.wikimedia.org/T257354 (10BCornwall) Untagging traffic as it seems like there's not much we can do here. Please feel free to retag us if that changes! [22:11:21] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Vopsbot doesn't have channel topic rights - https://phabricator.wikimedia.org/T329791 (10Legoktm) We can set the bot to be auto-opped (+O) if that's desired, but the general guidance is to be opped only when necessary (https://libera.chat/g... [22:11:40] (03PS1) 10Dzahn: gitlab: check for a new release string (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/889640 (https://phabricator.wikimedia.org/T323932) [22:12:08] (03CR) 10CI reject: [V: 04-1] gitlab: check for a new release string (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/889640 (https://phabricator.wikimedia.org/T323932) (owner: 10Dzahn) [22:17:05] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: maintenance [22:22:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: maintenance [22:36:36] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.17.0-a16 [vendor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889607 (https://phabricator.wikimedia.org/T329740) [22:36:40] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thank you JBond and Jaime! I think it's ok to move ahead because before Antoine went on vacation he told me on IRC we agree on the contint" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [22:40:00] (03CR) 10Dduvall: [C: 03+2] Bump wikimedia/parsoid to 0.17.0-a16 [vendor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889607 (https://phabricator.wikimedia.org/T329740) (owner: 10C. Scott Ananian) [22:41:41] (03CR) 10C. Scott Ananian: "See https://phabricator.wikimedia.org/T325586#8620353 -- we're leaving this to the ops discretion whether to merge and backport a new medi" [vendor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889607 (https://phabricator.wikimedia.org/T329740) (owner: 10C. Scott Ananian) [22:45:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dduvall@deploy1002 using scap backport" [vendor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889607 (https://phabricator.wikimedia.org/T329740) (owner: 10C. Scott Ananian) [22:46:11] (03PS1) 10Dzahn: contint: fix contint1002 host name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/889643 [22:46:33] (03CR) 10Dzahn: [V: 03+2 C: 03+2] contint: fix contint1002 host name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/889643 (owner: 10Dzahn) [22:48:30] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "on doc hosts: noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [22:51:52] (03PS7) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) [22:53:08] (03CR) 10Dzahn: [C: 03+2] "If this triggers we now expect automatic tickets to be created (and only that), and it's new.. so let's confirm it works." [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:55:24] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.17.0-a16 [vendor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889607 (https://phabricator.wikimedia.org/T329740) (owner: 10C. Scott Ananian) [22:55:37] (03PS3) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) [22:55:50] !log dduvall@deploy1002 Started scap: Backport for [[gerrit:889607|Bump wikimedia/parsoid to 0.17.0-a16 (T329740)]] [22:55:54] T329740: PHP Notice: Undefined property: stdClass::$href - https://phabricator.wikimedia.org/T329740 [22:57:08] (03Abandoned) 10Dzahn: phabricator: remove all vcs related code [puppet] - 10https://gerrit.wikimedia.org/r/865182 (owner: 10Dzahn) [22:57:42] !log dduvall@deploy1002 cscott and dduvall: Backport for [[gerrit:889607|Bump wikimedia/parsoid to 0.17.0-a16 (T329740)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:58:36] (03CR) 10Raymond Ndibe: "update: setup_all and teardown_all doesn't work on bats 0.4.0 so my initial suspicion was right. I hacked it by using setup (technically s" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [22:58:50] (03CR) 10Dzahn: [C: 04-2] "I am not I will work on this but let me add Janis, just to share that this was a thing." [puppet] - 10https://gerrit.wikimedia.org/r/695598 (https://phabricator.wikimedia.org/T283764) (owner: 10Dzahn) [23:00:17] (03CR) 10Dzahn: [C: 03+1] gitlab: remove dedicated restore logfile and log to syslog only [puppet] - 10https://gerrit.wikimedia.org/r/889546 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [23:01:55] !log running linter migrate namespace on all wikis (T329764) [23:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:58] T329764: Run the maintenance script linter extension migrateNamespace.php on all wikis - https://phabricator.wikimedia.org/T329764 [23:03:34] (03CR) 10Dzahn: [C: 03+1] "lgtm! for a moment I thought we may have to add aphlict2001 to an SSL cert but we already have only the discovery name in there, which is " [puppet] - 10https://gerrit.wikimedia.org/r/889531 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [23:04:19] (03PS1) 10Ladsgroup: Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889608 (https://phabricator.wikimedia.org/T329342) [23:04:28] jouncebot: nowandnext [23:04:28] No deployments scheduled for the next 7 hour(s) and 55 minute(s) [23:04:28] In 7 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T0700) [23:04:28] In 7 hour(s) and 55 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230216T0700) [23:04:37] !log dduvall@deploy1002 Finished scap: Backport for [[gerrit:889607|Bump wikimedia/parsoid to 0.17.0-a16 (T329740)]] (duration: 08m 47s) [23:04:38] (03CR) 10Ladsgroup: [C: 03+2] Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889608 (https://phabricator.wikimedia.org/T329342) (owner: 10Ladsgroup) [23:04:41] T329740: PHP Notice: Undefined property: stdClass::$href - https://phabricator.wikimedia.org/T329740 [23:05:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/Linter] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889608 (https://phabricator.wikimedia.org/T329342) (owner: 10Ladsgroup) [23:06:31] (03Merged) 10jenkins-bot: Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889608 (https://phabricator.wikimedia.org/T329342) (owner: 10Ladsgroup) [23:06:55] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:889608|Change linter maintenance scripts to use existing config varaibles (T329342)]] [23:06:58] T329342: Enable maintenance Linter data migration scripts for namespace and tag and template - https://phabricator.wikimedia.org/T329342 [23:07:03] jouncebot: now [23:07:03] No deployments scheduled for the next 7 hour(s) and 52 minute(s) [23:07:43] Amir1: let me know when you're done. i'm going to re-roll group1 following a parsoid/vendor backport [23:07:59] sure [23:08:46] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:889608|Change linter maintenance scripts to use existing config varaibles (T329342)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [23:08:55] (03PS1) 10Jcrespo: Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) [23:14:29] (03PS2) 10Jcrespo: Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) [23:15:07] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:889608|Change linter maintenance scripts to use existing config varaibles (T329342)]] (duration: 08m 12s) [23:15:11] T329342: Enable maintenance Linter data migration scripts for namespace and tag and template - https://phabricator.wikimedia.org/T329342 [23:15:22] dduvall: I'm done [23:15:29] ty [23:15:42] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889648 (https://phabricator.wikimedia.org/T325586) [23:15:44] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889648 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [23:16:19] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889648 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [23:19:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:23:38] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.23 refs T325586 [23:23:42] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [23:24:43] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:33] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:30:22] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.23 refs T325586 (duration: 06m 43s) [23:30:26] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [23:51:42] (03PS2) 10Ladsgroup: build: Restore 'buildConfigCache' call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889562 (https://phabricator.wikimedia.org/T329518) (owner: 10Krinkle) [23:51:46] (03CR) 10Ladsgroup: [C: 03+2] build: Restore 'buildConfigCache' call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889562 (https://phabricator.wikimedia.org/T329518) (owner: 10Krinkle) [23:52:22] (03Merged) 10jenkins-bot: build: Restore 'buildConfigCache' call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889562 (https://phabricator.wikimedia.org/T329518) (owner: 10Krinkle) [23:53:17] rebased ^ no need to sync