[00:06:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906703 [00:39:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906703 (owner: 10TrainBranchBot) [00:43:47] PROBLEM - Check systemd state on cp2039 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/906703 (owner: 10TrainBranchBot) [00:56:29] RECOVERY - Check systemd state on cp2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:34] 10ops-codfw, 10Data-Persistence-Backup: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder) [01:26:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T0200) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.4 [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/906704 (https://phabricator.wikimedia.org/T330210) [02:08:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.4 [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/906704 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [02:24:35] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.4 [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/906704 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:37] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T0300) [03:00:19] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:23] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907560 (https://phabricator.wikimedia.org/T330210) [03:01:25] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907560 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [03:02:07] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907560 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [03:02:31] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.4 refs T330210 [03:02:36] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [03:20:09] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:30:45] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:34:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:39:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:49:41] (NodeTextfileStale) firing: (2) Stale textfile for puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:52:28] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.4 refs T330210 (duration: 49m 57s) [03:52:33] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [03:54:45] !log mwpresync@deploy2002 Pruned MediaWiki: 1.41.0-wmf.2 (duration: 02m 15s) [04:05:45] RECOVERY - dump of db_inventory in codfw on backupmon1001 is OK: Last dump for db_inventory at codfw (db2185) taken on 2023-04-11 03:50:48 (95 KiB, +1.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:08:45] RECOVERY - dump of db_inventory in eqiad on backupmon1001 is OK: Last dump for db_inventory at eqiad (db1115) taken on 2023-04-11 03:58:02 (93 KiB, +1.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:22:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:55:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/907438 (owner: 10Marostegui) [05:39:09] (03PS1) 10Muehlenhoff: Remove access for swak [puppet] - 10https://gerrit.wikimedia.org/r/907710 [05:39:39] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Legoktm) https://gitlab.wikimedia.org/legoktm/planet is what I have so far, the basic structure is in place and it works, but is really rough. The README outlines the remaini... [05:42:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for swak [puppet] - 10https://gerrit.wikimedia.org/r/907710 (owner: 10Muehlenhoff) [05:43:55] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Swakiyama out of all services on: 1241 hosts [05:44:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Swakiyama out of all services on: 1241 hosts [05:45:01] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Swakiyama out of all services on: 814 hosts [05:45:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Swakiyama out of all services on: 814 hosts [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T0600). [06:00:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [06:04:23] (03PS1) 10Marostegui: db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907711 (https://phabricator.wikimedia.org/T326206) [06:04:33] (03CR) 10Marostegui: [C: 03+2] packages_wmf.pp: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/907438 (owner: 10Marostegui) [06:04:59] (03CR) 10Marostegui: [C: 03+2] db1124: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907711 (https://phabricator.wikimedia.org/T326206) (owner: 10Marostegui) [06:05:59] (03PS1) 10Marostegui: instances.yaml: Add db1224 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907712 (https://phabricator.wikimedia.org/T326206) [06:06:26] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1224 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907712 (https://phabricator.wikimedia.org/T326206) (owner: 10Marostegui) [06:09:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1224 to dbctl T326206', diff saved to https://phabricator.wikimedia.org/P46244 and previous config saved to /var/cache/conftool/dbconfig/20230411-060922-marostegui.json [06:09:28] T326206: Move db1176 and db2151 to s6 - https://phabricator.wikimedia.org/T326206 [06:09:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46245 and previous config saved to /var/cache/conftool/dbconfig/20230411-060937-root.json [06:10:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 to clone db1210 T326669', diff saved to https://phabricator.wikimedia.org/P46246 and previous config saved to /var/cache/conftool/dbconfig/20230411-061044-marostegui.json [06:10:49] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:12:50] (03PS1) 10Marostegui: db1210: Move it to s5 [puppet] - 10https://gerrit.wikimedia.org/r/907713 (https://phabricator.wikimedia.org/T326669) [06:13:18] (03CR) 10Marostegui: [C: 03+2] db1210: Move it to s5 [puppet] - 10https://gerrit.wikimedia.org/r/907713 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:14:42] (03PS1) 10Marostegui: instances.yaml: Add db1209 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907714 (https://phabricator.wikimedia.org/T326669) [06:15:12] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1209 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907714 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1209 to dbctl T326206', diff saved to https://phabricator.wikimedia.org/P46248 and previous config saved to /var/cache/conftool/dbconfig/20230411-061642-marostegui.json [06:16:48] T326206: Move db1176 and db2151 to s6 - https://phabricator.wikimedia.org/T326206 [06:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 1%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46249 and previous config saved to /var/cache/conftool/dbconfig/20230411-061755-root.json [06:18:00] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:18:27] (03PS1) 10Marostegui: db1209: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907716 (https://phabricator.wikimedia.org/T326669) [06:18:52] (03CR) 10Marostegui: [C: 03+2] db1209: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907716 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:20:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 37 hosts with reason: Primary switchover s1 T334375 [06:20:23] T334375: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T334375 [06:21:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 37 hosts with reason: Primary switchover s1 T334375 [06:21:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1163 with weight 0 T334375', diff saved to https://phabricator.wikimedia.org/P46250 and previous config saved to /var/cache/conftool/dbconfig/20230411-062127-root.json [06:21:55] (03PS1) 10Muehlenhoff: Remove obsolete pinning after recent toolsdb migration [puppet] - 10https://gerrit.wikimedia.org/r/907717 [06:22:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/906698 (https://phabricator.wikimedia.org/T334375) (owner: 10Gerrit maintenance bot) [06:23:04] (03CR) 10Marostegui: [C: 03+1] Remove obsolete pinning after recent toolsdb migration [puppet] - 10https://gerrit.wikimedia.org/r/907717 (owner: 10Muehlenhoff) [06:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46251 and previous config saved to /var/cache/conftool/dbconfig/20230411-062442-root.json [06:33:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 2%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46252 and previous config saved to /var/cache/conftool/dbconfig/20230411-063300-root.json [06:33:06] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:38:44] 10ops-codfw, 10DBA: db1210 ethernet negotiating at 10 Mbps - https://phabricator.wikimedia.org/T334446 (10Marostegui) [06:39:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46254 and previous config saved to /var/cache/conftool/dbconfig/20230411-063947-root.json [06:40:08] (03PS1) 10Muehlenhoff: aqs: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/907718 [06:40:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907718 (owner: 10Muehlenhoff) [06:42:10] (03PS2) 10Muehlenhoff: aqs: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/907718 [06:43:38] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [06:44:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907718 (owner: 10Muehlenhoff) [06:47:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10fgiunchedi) I noticed the weekly "software-update" emails from bgpalerter, can those be disabled ? (i.e. the version check I guess) [06:48:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 3%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46255 and previous config saved to /var/cache/conftool/dbconfig/20230411-064805-root.json [06:48:10] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:50:44] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: bring up kafka-logging1005 with node id 1005 [puppet] - 10https://gerrit.wikimedia.org/r/907505 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [06:50:51] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: stop kafka service on kafka-logging1002 [puppet] - 10https://gerrit.wikimedia.org/r/907504 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [06:53:25] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: decouple template_version and ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/906701 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [06:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46256 and previous config saved to /var/cache/conftool/dbconfig/20230411-065452-root.json [06:56:14] !log Starting s1 eqiad failover from db1118 to db1163 - T334375 [06:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:19] T334375: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T334375 [06:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1163 to s1 primary T334375', diff saved to https://phabricator.wikimedia.org/P46257 and previous config saved to /var/cache/conftool/dbconfig/20230411-065639-root.json [06:57:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 T334375', diff saved to https://phabricator.wikimedia.org/P46258 and previous config saved to /var/cache/conftool/dbconfig/20230411-065734-marostegui.json [07:00:05] Amir1, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T0700). [07:00:05] Jhs: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [07:00:33] 👋 i'm here [07:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46260 and previous config saved to /var/cache/conftool/dbconfig/20230411-070037-root.json [07:01:58] (03PS1) 10Marostegui: db1211: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907720 (https://phabricator.wikimedia.org/T326669) [07:02:23] (03CR) 10Marostegui: [C: 03+2] db1211: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907720 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 4%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46261 and previous config saved to /var/cache/conftool/dbconfig/20230411-070310-root.json [07:03:15] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:03:42] (03PS1) 10Marostegui: instances.yaml: Add db1211 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907721 (https://phabricator.wikimedia.org/T326669) [07:04:26] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: sink notifications for dev/test hosts [puppet] - 10https://gerrit.wikimedia.org/r/906736 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [07:04:56] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1211 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907721 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:06:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1211 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46262 and previous config saved to /var/cache/conftool/dbconfig/20230411-070609-marostegui.json [07:06:19] I can deploy [07:06:35] zabe, yay [07:06:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46263 and previous config saved to /var/cache/conftool/dbconfig/20230411-070641-root.json [07:07:12] (03CR) 10Zabe: [C: 03+2] Add blkwiki to wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906793 (https://phabricator.wikimedia.org/T334351) (owner: 10Jon Harald Søby) [07:08:02] (03Merged) 10jenkins-bot: Add blkwiki to wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906793 (https://phabricator.wikimedia.org/T334351) (owner: 10Jon Harald Søby) [07:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46264 and previous config saved to /var/cache/conftool/dbconfig/20230411-070956-root.json [07:10:05] !log zabe@deploy2002 Started scap: Backport for [[gerrit:906793|Add blkwiki to wgSitename (T334351)]] [07:10:10] T334351: Fix sitename for blkwiki - https://phabricator.wikimedia.org/T334351 [07:10:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35467 [07:10:22] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: rename aux-k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/906539 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi) [07:10:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35467 [07:10:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 150279 [07:11:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 150279 [07:11:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 393731 [07:11:37] !log zabe@deploy2002 zabe and jhsoby: Backport for [[gerrit:906793|Add blkwiki to wgSitename (T334351)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [07:11:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 393731 [07:12:05] zabe, looks good on 1002 [07:12:35] thanks, syncing [07:14:19] (03Abandoned) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [07:15:14] (03PS1) 10Marostegui: instances.yaml: Remove db1103 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907789 (https://phabricator.wikimedia.org/T332293) [07:15:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46265 and previous config saved to /var/cache/conftool/dbconfig/20230411-071542-root.json [07:16:12] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1103 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/907789 (https://phabricator.wikimedia.org/T332293) (owner: 10Marostegui) [07:16:47] (03CR) 10Ayounsi: cr-cloud: remove labstore term (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [07:16:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1103 from dbctl T332293', diff saved to https://phabricator.wikimedia.org/P46266 and previous config saved to /var/cache/conftool/dbconfig/20230411-071647-marostegui.json [07:16:52] T332293: decommission db1103.eqiad.wmnet - https://phabricator.wikimedia.org/T332293 [07:18:14] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:906793|Add blkwiki to wgSitename (T334351)]] (duration: 08m 08s) [07:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 5%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46267 and previous config saved to /var/cache/conftool/dbconfig/20230411-071815-root.json [07:18:18] T334351: Fix sitename for blkwiki - https://phabricator.wikimedia.org/T334351 [07:18:22] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:18:24] Jhs: should be live :) [07:18:50] yes, confirmed. thanks zabe [07:20:30] (Access port speed <= 100Mbps) resolved: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [07:20:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10ayounsi) Relevant https://github.com/nttgin/BGPalerter/issues/1058 [07:21:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [07:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46268 and previous config saved to /var/cache/conftool/dbconfig/20230411-072146-root.json [07:23:19] the thanos alert is me btw [07:23:31] PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-aux.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:14] (03PS1) 10Marostegui: db1218: Place it in s1 [puppet] - 10https://gerrit.wikimedia.org/r/907791 (https://phabricator.wikimedia.org/T326669) [07:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46269 and previous config saved to /var/cache/conftool/dbconfig/20230411-072501-root.json [07:25:42] (03CR) 10Marostegui: [C: 03+2] db1218: Place it in s1 [puppet] - 10https://gerrit.wikimedia.org/r/907791 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:28:56] (03PS3) 10Hashar: contint: Jenkins master > controller [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) [07:29:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:30:31] !log restarting blazegraph on wdqs1007 (stuck for 48hours) [07:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46270 and previous config saved to /var/cache/conftool/dbconfig/20230411-073047-root.json [07:30:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1103.eqiad.wmnet [07:31:14] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:31:58] (03PS1) 10Marostegui: mariadb: Decommission db1103 [puppet] - 10https://gerrit.wikimedia.org/r/907792 (https://phabricator.wikimedia.org/T332293) [07:32:51] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [07:33:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 10%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46271 and previous config saved to /var/cache/conftool/dbconfig/20230411-073319-root.json [07:33:24] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:35:50] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:36:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1103 [puppet] - 10https://gerrit.wikimedia.org/r/907792 (https://phabricator.wikimedia.org/T332293) (owner: 10Marostegui) [07:36:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46272 and previous config saved to /var/cache/conftool/dbconfig/20230411-073651-root.json [07:36:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:37:03] (03CR) 10Hashar: [C: 03+1] "Rebased for good measure. PCC https://puppet-compiler.wmflabs.org/output/893412/1712/" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:37:44] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1103.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:39:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1103.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:39:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:39:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1103.eqiad.wmnet [07:39:21] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10elukey) [07:39:51] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [07:40:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46273 and previous config saved to /var/cache/conftool/dbconfig/20230411-074006-root.json [07:40:10] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10elukey) [07:40:26] 10ops-eqiad, 10decommission-hardware: decommission db1103.eqiad.wmnet - https://phabricator.wikimedia.org/T332293 (10Marostegui) a:05Marostegui→03None [07:40:30] 10ops-eqiad, 10decommission-hardware: decommission db1103.eqiad.wmnet - https://phabricator.wikimedia.org/T332293 (10Marostegui) [07:43:18] RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46274 and previous config saved to /var/cache/conftool/dbconfig/20230411-074552-root.json [07:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 25%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46275 and previous config saved to /var/cache/conftool/dbconfig/20230411-074824-root.json [07:48:29] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:50:08] 10SRE, 10ops-eqiad, 10DBA: db1210 ethernet negotiating at 10 Mbps - https://phabricator.wikimedia.org/T334446 (10Marostegui) [07:51:23] (03CR) 10Elukey: ml-services: FastAPI chart using sextant for ores-legacy service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:51:24] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:51:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46276 and previous config saved to /var/cache/conftool/dbconfig/20230411-075155-root.json [07:54:43] !log restart haproxy on cp2033 - T334448 [07:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:47] T334448: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 [07:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46277 and previous config saved to /var/cache/conftool/dbconfig/20230411-075511-root.json [08:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46278 and previous config saved to /var/cache/conftool/dbconfig/20230411-080057-root.json [08:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 50%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46279 and previous config saved to /var/cache/conftool/dbconfig/20230411-080329-root.json [08:03:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps got worse - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [08:03:34] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:05:14] (03CR) 10Filippo Giunchedi: [C: 03+2] Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:06:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) Pleased to note no disk errors over the weekend. [08:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46280 and previous config saved to /var/cache/conftool/dbconfig/20230411-080700-root.json [08:10:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46281 and previous config saved to /var/cache/conftool/dbconfig/20230411-081016-root.json [08:15:14] !log About to deploy analytics/refinery (To migrate webrequest load from Oozie to Airflow) [08:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46282 and previous config saved to /var/cache/conftool/dbconfig/20230411-081601-root.json [08:18:31] !log aqu@deploy2002 Started deploy [analytics/refinery@bed78f6] (hadoop-test): Deploy analytics_refinery including last webrequest load scripts in TEST 3nd try [analytics/refinery@bed78f6] [08:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 75%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46283 and previous config saved to /var/cache/conftool/dbconfig/20230411-081834-root.json [08:18:39] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:19:56] !log aqu@deploy2002 Finished deploy [analytics/refinery@bed78f6] (hadoop-test): Deploy analytics_refinery including last webrequest load scripts in TEST 3nd try [analytics/refinery@bed78f6] (duration: 01m 25s) [08:20:04] (03PS2) 10Majavah: cr-cloud: remove labstore term [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) [08:20:21] (03CR) 10Majavah: cr-cloud: remove labstore term (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:20:26] RECOVERY - Check systemd state on alert2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46284 and previous config saved to /var/cache/conftool/dbconfig/20230411-082205-root.json [08:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46285 and previous config saved to /var/cache/conftool/dbconfig/20230411-082521-root.json [08:31:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46286 and previous config saved to /var/cache/conftool/dbconfig/20230411-083106-root.json [08:32:51] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [08:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1209 (re)pooling @ 100%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46287 and previous config saved to /var/cache/conftool/dbconfig/20230411-083339-root.json [08:33:44] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:36:00] (03PS1) 10Marostegui: mariadb: Productionize db1218 [puppet] - 10https://gerrit.wikimedia.org/r/907796 (https://phabricator.wikimedia.org/T326669) [08:37:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46288 and previous config saved to /var/cache/conftool/dbconfig/20230411-083710-root.json [08:38:32] (03CR) 10Ayounsi: [C: 03+1] tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [08:39:36] (03CR) 10David Caro: [C: 03+1] cr-cloud: remove clouddb_return term [homer/public] - 10https://gerrit.wikimedia.org/r/907132 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [08:39:44] (03CR) 10David Caro: [C: 03+1] cr-cloud: remove labstore term [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:40:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1218 [puppet] - 10https://gerrit.wikimedia.org/r/907796 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [08:45:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-cloud: remove labstore term [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:45:48] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) @Jclark-ctr thanks! I tried to check the serial console but I still see the error msg about the preserved cache, and I can't really do much on the menu.. the mai... [08:46:50] (03CR) 10Ayounsi: [C: 03+2] cr-cloud: remove clouddb_return term [homer/public] - 10https://gerrit.wikimedia.org/r/907132 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [08:47:14] 10SRE, 10MediaWiki-extensions-OAuth, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) @Tgr Trying to login to Wikipedia Library I and a fellow... [08:47:26] (03Merged) 10jenkins-bot: cr-cloud: remove clouddb_return term [homer/public] - 10https://gerrit.wikimedia.org/r/907132 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [08:50:10] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:44] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [08:52:01] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) [08:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46289 and previous config saved to /var/cache/conftool/dbconfig/20230411-085215-root.json [08:53:15] (03CR) 10Ayounsi: [C: 03+2] cr-cloud: remove labstore term [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:53:48] (03Merged) 10jenkins-bot: cr-cloud: remove labstore term [homer/public] - 10https://gerrit.wikimedia.org/r/907133 (https://phabricator.wikimedia.org/T333477) (owner: 10Majavah) [08:56:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 to clone db1222 T326669', diff saved to https://phabricator.wikimedia.org/P46290 and previous config saved to /var/cache/conftool/dbconfig/20230411-085654-marostegui.json [08:56:58] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:59:33] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8765217, @Eevans wrote: > I definitely wouldn't want to attempt an 8 -> 11 upgrade, //combined// with the move to Bullseye (too many moving parts). We might be... [09:00:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph: Allow setting a crush location hook for the rack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [09:00:48] (03CR) 10David Caro: [C: 03+2] ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [09:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46292 and previous config saved to /var/cache/conftool/dbconfig/20230411-090310-root.json [09:04:56] !log installing pcre2 security updates [09:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46293 and previous config saved to /var/cache/conftool/dbconfig/20230411-090720-root.json [09:08:16] (03PS1) 10David Caro: Revert "packages_wmf.pp: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/907738 [09:10:10] (03PS2) 10David Caro: Revert "packages_wmf.pp: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/907738 [09:10:38] (03CR) 10Marostegui: [C: 03+1] ":(" [puppet] - 10https://gerrit.wikimedia.org/r/907738 (owner: 10David Caro) [09:15:16] (03PS1) 10Marostegui: install_server: Do not reimage db1209 [puppet] - 10https://gerrit.wikimedia.org/r/907802 [09:16:06] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1209 [puppet] - 10https://gerrit.wikimedia.org/r/907802 (owner: 10Marostegui) [09:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46294 and previous config saved to /var/cache/conftool/dbconfig/20230411-091815-root.json [09:19:47] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [09:20:34] !log installing nodejs security updates on buster [09:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:27] (03CR) 10Jbond: [C: 03+1] Remove obsolete pinning after recent toolsdb migration [puppet] - 10https://gerrit.wikimedia.org/r/907717 (owner: 10Muehlenhoff) [09:21:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:21:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:22:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46295 and previous config saved to /var/cache/conftool/dbconfig/20230411-092224-root.json [09:23:49] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:24:06] (03PS1) 10Marostegui: db1222: Move it to s2 [puppet] - 10https://gerrit.wikimedia.org/r/907805 (https://phabricator.wikimedia.org/T326669) [09:24:47] (03CR) 10Marostegui: [C: 03+2] db1222: Move it to s2 [puppet] - 10https://gerrit.wikimedia.org/r/907805 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [09:24:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 9.432 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.529 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:06] (03CR) 10Jbond: [C: 03+2] contint: Jenkins master > controller [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:25:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:26:48] hashar: fyi this is merged now ^^ [09:27:07] !log start of watchlist clean up of a user in wikidatawiki (T328501) [09:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:11] T328501: Request to clean my watchlist from articles in namespace 0 and 1 - https://phabricator.wikimedia.org/T328501 [09:27:54] (03CR) 10AikoChou: [C: 03+1] httpbb: Add test cases for trwiki editquality inference services [puppet] - 10https://gerrit.wikimedia.org/r/906687 (https://phabricator.wikimedia.org/T334158) (owner: 10Kevin Bazira) [09:29:02] 10ops-codfw, 10Data-Persistence-Backup: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10jcrespo) @Jhancock.wm any time between 7am and 23:59 UTC is ok to service this host. If it is like a short period of network downtime it can be done right away. If it may be offline for an extended p... [09:32:45] (03CR) 10David Caro: [C: 03+2] Revert "packages_wmf.pp: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/907738 (owner: 10David Caro) [09:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46296 and previous config saved to /var/cache/conftool/dbconfig/20230411-093320-root.json [09:34:16] 10SRE, 10ops-eqiad, 10DBA: db1210 ethernet negotiating at 10 Mbps - https://phabricator.wikimedia.org/T334446 (10Marostegui) p:05Triage→03Medium [09:34:22] (03CR) 10Elukey: [C: 03+2] httpbb: Add test cases for trwiki editquality inference services [puppet] - 10https://gerrit.wikimedia.org/r/906687 (https://phabricator.wikimedia.org/T334158) (owner: 10Kevin Bazira) [09:34:26] (03CR) 10David Caro: [C: 03+1] "Thanks! \o/" [debs/karma] - 10https://gerrit.wikimedia.org/r/906716 (owner: 10Filippo Giunchedi) [09:36:29] (03CR) 10David Caro: [C: 03+2] "Fyi. the work is almost there https://phabricator.wikimedia.org/T301949" [puppet] - 10https://gerrit.wikimedia.org/r/907738 (owner: 10David Caro) [09:37:44] (03PS7) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [09:38:11] (03CR) 10CI reject: [V: 04-1] P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 (owner: 10Jbond) [09:38:53] (03CR) 10Filippo Giunchedi: [C: 03+2] New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/906716 (owner: 10Filippo Giunchedi) [09:38:55] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/906716 (owner: 10Filippo Giunchedi) [09:40:06] (03PS8) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [09:44:11] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [09:47:19] (03CR) 10Volans: [C: 03+2] tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [09:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46297 and previous config saved to /var/cache/conftool/dbconfig/20230411-094825-root.json [09:49:30] (03Merged) 10jenkins-bot: tests: check also a special syntax for quotes [software/homer] - 10https://gerrit.wikimedia.org/r/906729 (owner: 10Volans) [09:49:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40572/console" [puppet] - 10https://gerrit.wikimedia.org/r/907493 (owner: 10Jbond) [09:50:38] (03PS1) 10Marostegui: wmnet: Remove old tendril reference [dns] - 10https://gerrit.wikimedia.org/r/907808 (https://phabricator.wikimedia.org/T297605) [09:51:26] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [09:52:25] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove old tendril reference [dns] - 10https://gerrit.wikimedia.org/r/907808 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [09:52:46] (03CR) 10Ayounsi: Add generic way to create static routes on switches (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/906726 (https://phabricator.wikimedia.org/T334281) (owner: 10Cathal Mooney) [09:52:53] (03PS1) 10Elukey: httpbb: remove tests from liftwing production [puppet] - 10https://gerrit.wikimedia.org/r/907809 [09:54:48] (03PS1) 10Marostegui: site.pp: Productionize db1222 [puppet] - 10https://gerrit.wikimedia.org/r/907811 (https://phabricator.wikimedia.org/T326669) [09:56:07] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10elukey) Needs `profile::base::remove_python2_on_bullseye: false` in hiera to be deployed before reimaging, then we should be good to go! Todo: check if the /srv partition content is preserved by partman... [09:56:40] (03PS9) 10Jbond: P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 [09:56:42] (03PS1) 10Jbond: hiera: make netbox common more perfered then common [puppet] - 10https://gerrit.wikimedia.org/r/907812 [09:57:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1000) [10:01:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.337 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:01:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.424 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:02:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40573/console" [puppet] - 10https://gerrit.wikimedia.org/r/907812 (owner: 10Jbond) [10:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46298 and previous config saved to /var/cache/conftool/dbconfig/20230411-100330-root.json [10:03:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] hiera: make netbox common more perfered then common [puppet] - 10https://gerrit.wikimedia.org/r/907812 (owner: 10Jbond) [10:05:30] (03CR) 10Marostegui: [C: 03+2] site.pp: Productionize db1222 [puppet] - 10https://gerrit.wikimedia.org/r/907811 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:09:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40575/console" [puppet] - 10https://gerrit.wikimedia.org/r/907493 (owner: 10Jbond) [10:10:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:14:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:15:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.889 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:15:14] (03PS1) 10Marostegui: db1115,db2185: Add note [puppet] - 10https://gerrit.wikimedia.org/r/907816 [10:16:12] (03CR) 10Marostegui: [C: 03+2] db1115,db2185: Add note [puppet] - 10https://gerrit.wikimedia.org/r/907816 (owner: 10Marostegui) [10:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46300 and previous config saved to /var/cache/conftool/dbconfig/20230411-101835-root.json [10:18:39] (03PS1) 10Jelto: install_server: start gitlab raids with smaller minimum size [puppet] - 10https://gerrit.wikimedia.org/r/907819 (https://phabricator.wikimedia.org/T330172) [10:18:41] (03CR) 10JMeybohm: [C: 03+1] "LGTM apart from the two nits, but that's up to you really so +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [10:19:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:02] (03PS5) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [10:20:06] (03PS1) 10Marostegui: db1215: To become a zarcillo host [puppet] - 10https://gerrit.wikimedia.org/r/907820 (https://phabricator.wikimedia.org/T326669) [10:20:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:21:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46301 and previous config saved to /var/cache/conftool/dbconfig/20230411-102106-root.json [10:21:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:24:19] (03CR) 10Marostegui: [C: 03+2] db1215: To become a zarcillo host [puppet] - 10https://gerrit.wikimedia.org/r/907820 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:26:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox: add consumers for prefixes and net devices [puppet] - 10https://gerrit.wikimedia.org/r/907493 (owner: 10Jbond) [10:26:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:27:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:28:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:29] (03PS1) 10Marostegui: site.pp: Productionize db1215 [puppet] - 10https://gerrit.wikimedia.org/r/907822 (https://phabricator.wikimedia.org/T326669) [10:33:01] (03CR) 10Marostegui: [C: 03+2] site.pp: Productionize db1215 [puppet] - 10https://gerrit.wikimedia.org/r/907822 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:33:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:27] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [10:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46302 and previous config saved to /var/cache/conftool/dbconfig/20230411-103339-root.json [10:36:07] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [10:36:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46303 and previous config saved to /var/cache/conftool/dbconfig/20230411-103611-root.json [10:36:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [10:38:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.416 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46304 and previous config saved to /var/cache/conftool/dbconfig/20230411-104844-root.json [10:50:48] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 T334447', diff saved to https://phabricator.wikimedia.org/P46305 and previous config saved to /var/cache/conftool/dbconfig/20230411-105100-marostegui.json [10:51:05] T334447: decommission db1107.eqiad.wmnet - https://phabricator.wikimedia.org/T334447 [10:51:07] (03PS1) 10Vgutierrez: varnish: Allow disabling port 80 [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) [10:51:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46306 and previous config saved to /var/cache/conftool/dbconfig/20230411-105116-root.json [10:51:53] (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907825 (https://phabricator.wikimedia.org/T334447) [10:52:02] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:55:46] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "For completeness: I had to back out of this change (reset master to 0.99 commit) because Karma dropped support for Alertmanager < 0.22 and" [debs/karma] - 10https://gerrit.wikimedia.org/r/906716 (owner: 10Filippo Giunchedi) [10:56:02] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40581/console" [puppet] - 10https://gerrit.wikimedia.org/r/907824 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:01:05] (03CR) 10Hashar: "The instance most probably still assumes wmflabs as a canonical hostname, then I think, the Hiera setting is solely used for Apache proxy " [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [11:01:52] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46307 and previous config saved to /var/cache/conftool/dbconfig/20230411-110349-root.json [11:04:05] (03PS10) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:06:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46308 and previous config saved to /var/cache/conftool/dbconfig/20230411-110621-root.json [11:07:21] (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/907825 (https://phabricator.wikimedia.org/T334447) (owner: 10Marostegui) [11:08:35] (03PS2) 10Samtar: Initial configuration for guwwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907506 (https://phabricator.wikimedia.org/T334394) [11:13:26] (03PS11) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:16:03] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [11:16:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:17:18] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 [11:17:28] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [11:18:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46309 and previous config saved to /var/cache/conftool/dbconfig/20230411-111854-root.json [11:19:28] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 [11:21:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46310 and previous config saved to /var/cache/conftool/dbconfig/20230411-112126-root.json [11:22:08] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [11:27:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:28:42] (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [11:29:27] (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [11:36:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.623 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46311 and previous config saved to /var/cache/conftool/dbconfig/20230411-113631-root.json [11:37:50] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [11:38:03] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907852 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [11:38:17] (03PS2) 10WMDE-Fisch: Deploy Nearby feature on most wikis [1/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) [11:38:23] (03PS2) 10WMDE-Fisch: Deploy Nearby feature on most wikis [2/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907852 (https://phabricator.wikimedia.org/T334079) [11:40:11] (03PS3) 10WMDE-Fisch: Deploy Nearby feature on most wikis [1/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) [11:41:42] (03PS4) 10WMDE-Fisch: Deploy Nearby feature on most wikis [1/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) [11:42:46] (03PS3) 10WMDE-Fisch: Deploy Nearby feature on most wikis [2/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907852 (https://phabricator.wikimedia.org/T334079) [11:42:58] (03CR) 10David Caro: [C: 03+2] p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [11:43:45] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add asincio (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/906065 (owner: 10Jbond) [11:48:45] (03PS10) 10Clément Goubert: P:httpbb: Add monitoring for all mw-on-k8s deployments [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) [11:50:10] (03CR) 10CI reject: [V: 04-1] P:httpbb: Add monitoring for all mw-on-k8s deployments [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [11:51:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46312 and previous config saved to /var/cache/conftool/dbconfig/20230411-115137-root.json [11:52:27] (03CR) 10Jbond: Netbox-extra: Add bandit and prospector to CI (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [11:52:39] 10SRE, 10ops-eqiad, 10DBA: db1210 ethernet negotiating at 10 Mbps - https://phabricator.wikimedia.org/T334446 (10Jclark-ctr) a:03Jclark-ctr [11:53:01] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/906066 (owner: 10Jbond) [11:54:55] (03PS11) 10Clément Goubert: P:httpbb: Add monitoring for all mw-on-k8s deployments [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) [11:56:45] (03CR) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [11:56:51] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40588/console" [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [11:57:42] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:59:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.151 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:04] (03PS1) 10Muehlenhoff: Remove support efi-stretch-installer [puppet] - 10https://gerrit.wikimedia.org/r/907868 [12:03:28] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10MoritzMuehlenhoff) >>! In T331712#8770704, @elukey wrote: > Todo: check if the /srv partition content is preserved by partman or not. We have an existing reuse-raid1-2dev.cfg Partman recipe, that should... [12:04:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49853 bytes in 5.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:59] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [12:09:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine, let's give it a shot!" [puppet] - 10https://gerrit.wikimedia.org/r/907819 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:13:30] (03CR) 10Jelto: [C: 03+2] install_server: start gitlab raids with smaller minimum size [puppet] - 10https://gerrit.wikimedia.org/r/907819 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [12:16:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:16:34] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:16:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:17:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:24:07] !log Setting mw2448.codfw.wmnet to pooled=invalid - T334429 [12:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:13] T334429: mw2448 crashed - https://phabricator.wikimedia.org/T334429 [12:24:20] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2448.*.codfw.wmnet [12:26:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:27:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:27:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:27:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:27:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T333332)', diff saved to https://phabricator.wikimedia.org/P46313 and previous config saved to /var/cache/conftool/dbconfig/20230411-122735-ladsgroup.json [12:27:40] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:29:45] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) p:05Triage→03Medium a:03Jhancock.wm [12:29:56] (03PS1) 10Marostegui: wmnet: Change zarcillo cname [dns] - 10https://gerrit.wikimedia.org/r/907873 (https://phabricator.wikimedia.org/T334455) [12:31:13] 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) ` cgoubert@mw2448:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jan-20-2023 | 16:14:47 | S... [12:31:21] (03CR) 10Marostegui: [C: 03+2] wmnet: Change zarcillo cname [dns] - 10https://gerrit.wikimedia.org/r/907873 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:38:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T333332)', diff saved to https://phabricator.wikimedia.org/P46314 and previous config saved to /var/cache/conftool/dbconfig/20230411-123803-ladsgroup.json [12:38:12] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:38:15] (03PS1) 10Marostegui: dbprov1002.cnf: Change db_inventory backup source [puppet] - 10https://gerrit.wikimedia.org/r/907875 (https://phabricator.wikimedia.org/T334455) [12:38:32] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [12:38:50] (03CR) 10Marostegui: "jcrespo, I would appreciate if you can merge this and right after run a db_inventory backup to make sure it is all fine. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/907875 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:41:22] (03CR) 10Jcrespo: "Is it ready now? or do you have to do something before merge?" [puppet] - 10https://gerrit.wikimedia.org/r/907875 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:42:21] (03CR) 10Marostegui: "If there are no writes needed to zarcillo DB (which I assume there aren't since dbbackups was moved to m1), it is ready. db1215 currently " [puppet] - 10https://gerrit.wikimedia.org/r/907875 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:45:00] (03CR) 10Jcrespo: [C: 03+2] dbprov1002.cnf: Change db_inventory backup source [puppet] - 10https://gerrit.wikimedia.org/r/907875 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:46:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:46] (03PS1) 10EoghanGaffney: Add dummy secrets for Gitlab SSH keys [labs/private] - 10https://gerrit.wikimedia.org/r/907876 [12:50:56] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [12:51:34] (03CR) 10Jcrespo: [C: 03+2] "I got an "Access denied". My guess is it lacks the dump user(s)." [puppet] - 10https://gerrit.wikimedia.org/r/907875 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [12:52:07] (03CR) 10Ottomata: [C: 03+2] ::analytics::refinery::job::druid_load: absent remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906665 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns) [12:52:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.312 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P46315 and previous config saved to /var/cache/conftool/dbconfig/20230411-125310-ladsgroup.json [12:53:59] (03PS1) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [12:54:16] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [12:54:33] (03CR) 10CI reject: [V: 04-1] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [12:55:00] (03CR) 10EoghanGaffney: "The dummy keys are added to the labs/private repo in change Ifa3b9ca69a9ac21aa5171220349f0b3d6677e2dc" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [12:55:38] (03CR) 10Svantje Lilienthal: [C: 03+1] Deploy Nearby feature on most wikis [1/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [12:55:52] (03CR) 10Svantje Lilienthal: [C: 03+1] Deploy Nearby feature on most wikis [2/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907852 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [12:58:05] (03CR) 10Jbond: Netbox-extra: Add bandit and prospector to CI (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [12:58:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [12:59:52] PROBLEM - mediawiki-installation DSH group on mw2448 is CRITICAL: Host mw2448 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1300). [13:00:05] WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1300) [13:00:09] \o/ [13:00:22] (cannae today, sorry!) [13:01:19] I can deploy [13:01:34] WMDE-Fisch: are you intentionally disabling Nearby everywhere between the deployments of the first patch and the second? [13:02:00] Yeah, it should not matter too much. [13:02:18] It's only deployed on a few wikis. [13:02:40] And not used very much yet. [13:02:47] ok! [13:02:50] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/907868 (owner: 10Muehlenhoff) [13:02:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [13:02:56] So a few minutes downtime is okay I thought. [13:03:13] Thanks for deploying. [13:03:40] (03Merged) 10jenkins-bot: Deploy Nearby feature on most wikis [1/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907851 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [13:04:01] !log taavi@deploy2002 Started scap: Backport for [[gerrit:907851|Deploy Nearby feature on most wikis [1/2] (T334079)]] [13:04:07] T334079: Deploy Nearby feature to all wikis without conflicting features - https://phabricator.wikimedia.org/T334079 [13:04:40] (03PS4) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) [13:05:29] !log taavi@deploy2002 taavi and wmde-fisch: Backport for [[gerrit:907851|Deploy Nearby feature on most wikis [1/2] (T334079)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:05:42] WMDE-Fisch: anything to test about on the first patch? [13:05:51] Nope go on please :-) [13:05:55] (03PS5) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) [13:06:00] sure [13:07:42] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 265688000 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:08:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P46316 and previous config saved to /var/cache/conftool/dbconfig/20230411-130817-ladsgroup.json [13:08:24] (03PS6) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) [13:09:26] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [13:10:15] (03PS1) 10Ayounsi: Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 [13:10:56] (03PS3) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [13:11:26] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:907851|Deploy Nearby feature on most wikis [1/2] (T334079)]] (duration: 07m 24s) [13:11:31] T334079: Deploy Nearby feature to all wikis without conflicting features - https://phabricator.wikimedia.org/T334079 [13:11:47] (03CR) 10Slyngshede: "Let's try again, now that logs are rotated more frequently." [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [13:11:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907852 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [13:12:01] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [13:12:45] (03Merged) 10jenkins-bot: Deploy Nearby feature on most wikis [2/2] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907852 (https://phabricator.wikimedia.org/T334079) (owner: 10WMDE-Fisch) [13:13:07] !log taavi@deploy2002 Started scap: Backport for [[gerrit:907852|Deploy Nearby feature on most wikis [2/2] (T334079)]] [13:14:36] !log taavi@deploy2002 wmde-fisch and taavi: Backport for [[gerrit:907852|Deploy Nearby feature on most wikis [2/2] (T334079)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:14:47] (03CR) 10Jbond: [C: 03+1] "lgtm small nit" [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [13:14:52] WMDE-Fisch: please test the second patch [13:15:17] (03CR) 10Ayounsi: "That unfortunately doesn't help much with the Prospector errors. They went from Messages Found: 364" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [13:15:49] taavi: Works [13:15:57] Enabled / Disabled as expected. [13:16:08] Go on please :-) [13:16:27] ok, syncing [13:21:32] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:907852|Deploy Nearby feature on most wikis [2/2] (T334079)]] (duration: 08m 25s) [13:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:40] T334079: Deploy Nearby feature to all wikis without conflicting features - https://phabricator.wikimedia.org/T334079 [13:22:57] and done [13:23:03] taavi: Works like a charm. Thanks! [13:23:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T333332)', diff saved to https://phabricator.wikimedia.org/P46317 and previous config saved to /var/cache/conftool/dbconfig/20230411-132324-ladsgroup.json [13:23:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:23:30] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1029376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:23:30] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:23:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:23:44] 10SRE, 10ops-eqiad, 10DBA: db1210 ethernet negotiating at 10 Mbps - https://phabricator.wikimedia.org/T334446 (10Jclark-ctr) 05Open→03Resolved Replaced Cable. link light is back to 1g [13:23:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T333332)', diff saved to https://phabricator.wikimedia.org/P46318 and previous config saved to /var/cache/conftool/dbconfig/20230411-132348-ladsgroup.json [13:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:50] (03CR) 10Elukey: [C: 03+1] aqs: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/907718 (owner: 10Muehlenhoff) [13:27:22] 10SRE, 10ops-eqiad, 10DBA: db1210 ethernet negotiating at 10 Mbps - https://phabricator.wikimedia.org/T334446 (10Marostegui) Thank you for being so fast! Confirmed from my side too: ` [999080.252273] tg3 0000:04:00.0 eno8303: Link is down [999128.847759] tg3 0000:04:00.0 eno8303: Link is up at 1000 Mbps, ful... [13:27:25] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 539948424 and 115 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:28:30] (Access port speed <= 100Mbps) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [13:28:38] (03CR) 10Jbond: [C: 04-1] "idea seems fine some errors and an alternate approch suggested." [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:29:33] (03CR) 10Jbond: [C: 03+1] P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [13:32:08] (03PS12) 10Clément Goubert: P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) [13:32:30] (03PS2) 10Ayounsi: Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 [13:33:54] (03PS4) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [13:34:04] (03CR) 10Clément Goubert: P:httpbb: Add monitoring for kubernetes services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [13:34:14] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40589/console" [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [13:34:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T333332)', diff saved to https://phabricator.wikimedia.org/P46319 and previous config saved to /var/cache/conftool/dbconfig/20230411-133425-ladsgroup.json [13:34:31] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:34:57] (03CR) 10CI reject: [V: 04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [13:41:03] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 37032 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:45:27] (03CR) 10Herron: [C: 03+1] logstash: decouple template_version and ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/906701 (https://phabricator.wikimedia.org/T292585) (owner: 10Cwhite) [13:46:04] !log powercycle analytics1069, down for some days now, host stuck from the mgmt/serial console [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:43] RECOVERY - Host analytics1069 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [13:48:54] (03PS1) 10Hashar: ci: rename ci::master role to ci::manager [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) [13:49:00] (03PS1) 10Hashar: ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) [13:49:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P46320 and previous config saved to /var/cache/conftool/dbconfig/20230411-134932-ladsgroup.json [13:49:53] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [13:49:59] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [13:50:11] brett: hello! o/ Would you be able to help with updating my SSH key on file when you get a chance? T334423 Please and thank you :) [13:50:12] T334423: Update SSH key for Mikhail Popov - https://phabricator.wikimedia.org/T334423 [13:50:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jbond) [13:51:32] (03PS2) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [13:51:41] (03CR) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:51:44] (03CR) 10CI reject: [V: 04-1] ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [13:52:02] (03CR) 10CI reject: [V: 04-1] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:53:09] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet last ran 14 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:53:29] (03PS3) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [13:53:56] (03CR) 10CI reject: [V: 04-1] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:54:58] !log remove old puppet certificates for kafka main brokers from A:kafka-main - T319372 [13:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [13:55:19] (03PS4) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [13:55:42] (03CR) 10CI reject: [V: 04-1] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [13:56:25] (03CR) 10Jelto: [C: 04-1] "looks mostly good, two comments in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:58:00] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Tchanders) [13:58:28] (03PS1) 10David Caro: changelog: prepare for 0.93 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907887 [13:59:06] (03PS1) 10David Caro: debian: add defaults for changelog generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907888 [14:00:13] (03CR) 10David Caro: [C: 03+2] changelog: prepare for 0.93 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907887 (owner: 10David Caro) [14:00:27] !log Revoking kafka_main-codfw_broker and kafka_main-eqiad_broker puppet CA certs - T319372 [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:32] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [14:00:53] (03PS5) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 [14:01:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Papaul) @Jhancock.wm swap CPU2 with CPU1 and see if the error will report on CPU1 if it does then we will have to replace the CPU. if the error still shows on CPU... [14:01:22] (03CR) 10CI reject: [V: 04-1] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [14:01:28] (03Merged) 10jenkins-bot: changelog: prepare for 0.93 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907887 (owner: 10David Caro) [14:02:20] (03PS2) 10Hashar: ci: rename ci::master role to ci::manager [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) [14:02:53] (03CR) 10Jbond: [C: 03+1] Apply black to all python files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi) [14:04:06] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P46321 and previous config saved to /var/cache/conftool/dbconfig/20230411-140438-ladsgroup.json [14:04:40] (03PS1) 10David Caro: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 [14:05:57] (03PS4) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) [14:06:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [14:06:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40591/console" [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:08:20] (03PS6) 10Jbond: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [14:08:43] (03PS2) 10Hashar: ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) [14:08:48] (03CR) 10CI reject: [V: 04-1] Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [14:09:28] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:09:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/907876 (owner: 10EoghanGaffney) [14:10:11] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:10:55] (03PS7) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) [14:11:04] (03CR) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:12:25] (03CR) 10Jelto: [C: 03+1] Add dummy secrets for Gitlab SSH keys [labs/private] - 10https://gerrit.wikimedia.org/r/907876 (owner: 10EoghanGaffney) [14:12:46] secrets leak!! :O [14:13:04] get the snakeoil! [14:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:14:13] (03PS1) 10Jelto: install_server: change device names in gitlab-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/907893 (https://phabricator.wikimedia.org/T330172) [14:16:02] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2448.codfw.wmnet with reason: HW failure [14:16:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2448.codfw.wmnet with reason: HW failure [14:16:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b7d176fc-f1f5-4faf-a380-3ea6e306f06c) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s)... [14:16:36] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) @elukey I was able to cleared configurations [14:17:30] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] Add dummy secrets for Gitlab SSH keys [labs/private] - 10https://gerrit.wikimedia.org/r/907876 (owner: 10EoghanGaffney) [14:17:54] (03CR) 10Jelto: "The new recipe mostly works. However the raids use only half of the available space and it seems device names changed. So I updated the re" [puppet] - 10https://gerrit.wikimedia.org/r/907893 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T333332)', diff saved to https://phabricator.wikimedia.org/P46323 and previous config saved to /var/cache/conftool/dbconfig/20230411-141944-ladsgroup.json [14:19:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:19:49] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:20:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:20:17] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Thanks @Dzahn - I was able to get into wikitech with /FNavas-foundation/ but interestingly, I get nothing back from the reset password. Yet, I was... [14:21:31] (03PS3) 10Hashar: ci: rename ci::master role to ci::manager [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) [14:21:46] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:23:20] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/907897 (https://phabricator.wikimedia.org/T321309) [14:24:17] (03PS13) 10Clément Goubert: P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) [14:27:29] !log jnuche@deploy2002 Installing scap version "4.49.0" for 590 hosts [14:28:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:28:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:28:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T333332)', diff saved to https://phabricator.wikimedia.org/P46324 and previous config saved to /var/cache/conftool/dbconfig/20230411-142857-ladsgroup.json [14:29:02] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:29:07] !log jnuche@deploy2002 Installing scap version "4.49.0" for 590 hosts [14:29:43] (03PS1) 10Hashar: ci: add secreats for ci::manager and ci::worker roles [labs/private] - 10https://gerrit.wikimedia.org/r/907898 (https://phabricator.wikimedia.org/T254646) [14:30:09] (03CR) 10Hashar: [V: 03+2 C: 03+2] ci: add secreats for ci::manager and ci::worker roles [labs/private] - 10https://gerrit.wikimedia.org/r/907898 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:30:28] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable add link frontend in 7th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907899 (https://phabricator.wikimedia.org/T304551) [14:31:40] (03CR) 10Hashar: "The private repository has some credentials in hieradata/role/common/ci/master.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:31:50] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:33:18] (03PS1) 10Hashar: ci: move role hiera settings up to role/ci.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/907900 [14:34:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [14:34:43] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [14:37:40] (03CR) 10Hashar: "PCC looks good https://puppet-compiler.wmflabs.org/output/907885/1718/" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:38:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/907893 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:38:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T333332)', diff saved to https://phabricator.wikimedia.org/P46325 and previous config saved to /var/cache/conftool/dbconfig/20230411-143854-ladsgroup.json [14:39:08] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:39:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove support efi-stretch-installer [puppet] - 10https://gerrit.wikimedia.org/r/907868 (owner: 10Muehlenhoff) [14:40:36] (03PS2) 10Hashar: ci: add hiera settings for role::ci [labs/private] - 10https://gerrit.wikimedia.org/r/907900 [14:40:50] (03CR) 10Hashar: [V: 03+2 C: 03+2] ci: add hiera settings for role::ci [labs/private] - 10https://gerrit.wikimedia.org/r/907900 (owner: 10Hashar) [14:42:50] !log installing Tomcat security updates [14:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] (03CR) 10Volans: [C: 04-1] "I'm ok with the general approach, but see inline for the details" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/907880 (owner: 10Ayounsi) [14:43:51] (03PS3) 10Hashar: ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) [14:44:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:44:11] (03PS1) 10Kimberly Sarabia: Set up A/B test reqiurement for Zebra [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907902 (https://phabricator.wikimedia.org/T333493) [14:45:11] !log herron@cumin1001 START - Cookbook sre.dns.netbox [14:47:38] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add kafka-logging1005 ipv6 - herron@cumin1001" [14:48:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS buster [14:48:33] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [14:48:39] (03CR) 10Hashar: [V: 04-1] "I have to give it a bit more thoughts (PCC fails https://puppet-compiler.wmflabs.org/output/907886/1719/ )" [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:48:56] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10RobH) [14:49:12] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10RobH) [14:49:13] (03PS4) 10JHathaway: Add an in place Debian upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) [14:50:33] (03CR) 10JHathaway: Add an in place Debian upgrade script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:51:07] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add kafka-logging1005 ipv6 - herron@cumin1001" [14:51:07] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:11] (03PS9) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:52:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10herron) jftr I accepted this diff which came up during an unrelated sre.dns.netbox run ` diff --git a/hosts/mw2448.yaml b/hosts/mw2448.yaml index a58c536..120b45... [14:52:31] (03CR) 10Jelto: [C: 03+2] install_server: change device names in gitlab-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/907893 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:53:49] !log paused pageview hourly job. [14:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] (03CR) 10EoghanGaffney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/907878 (owner: 10EoghanGaffney) [14:54:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P46326 and previous config saved to /var/cache/conftool/dbconfig/20230411-145401-ladsgroup.json [14:57:07] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40593/console" [puppet] - 10https://gerrit.wikimedia.org/r/907897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:58:05] 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) a:03elukey [14:58:23] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) I went looking at the failures from yesterday's rclone run. As well as the above wikipedia-ja-local-public.21 I have two further candidates for deletion... [14:59:35] (03PS10) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [14:59:53] (03CR) 10BBlack: Varnish: prefix 403 and 429 with a unique ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973) (owner: 10Ayounsi) [15:00:47] (03PS1) 10Muehlenhoff: Add a component to provide a forward port of the Puppet 5 agent [puppet] - 10https://gerrit.wikimedia.org/r/907903 (https://phabricator.wikimedia.org/T330495) [15:01:39] !log deploying analytics refinery to update hive pageview hourly table with referer_data field. [15:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:49] (03PS11) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [15:03:51] !log ebysans@deploy2002 Started deploy [analytics/refinery@f3389dc]: Update pageview hourly table with referer data field [analytics/refinery@f3389dc] [15:07:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) >>! In T330495#8697765, @MoritzMuehlenhoff wrote: > One remaining issue is that the regenerate_certificate() function from Spicerack which gets... [15:07:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) @herron Thanks, my bad, I forgot to run `sre.dns.netbox` after setting the node to failed. Adding to the documentation. [15:09:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P46327 and previous config saved to /var/cache/conftool/dbconfig/20230411-150907-ladsgroup.json [15:09:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/907903 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [15:09:25] !log ebysans@deploy2002 Finished deploy [analytics/refinery@f3389dc]: Update pageview hourly table with referer data field [analytics/refinery@f3389dc] (duration: 05m 34s) [15:10:30] !log ebysans@deploy2002 Started deploy [analytics/refinery@f3389dc] (thin): Update pageview hourly table with referer data field THIN [analytics/refinery@f3389dc] [15:10:39] !log ebysans@deploy2002 Finished deploy [analytics/refinery@f3389dc] (thin): Update pageview hourly table with referer data field THIN [analytics/refinery@f3389dc] (duration: 00m 08s) [15:11:29] 10SRE, 10WMF-General-or-Unknown: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) [15:11:38] !log ebysans@deploy2002 Started deploy [analytics/refinery@f3389dc] (hadoop-test): Update pageview hourly table with referer data field TEST [analytics/refinery@f3389dc] [15:11:42] (03CR) 10Muehlenhoff: [C: 03+2] Add a component to provide a forward port of the Puppet 5 agent [puppet] - 10https://gerrit.wikimedia.org/r/907903 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [15:12:51] (03CR) 10Hashar: [V: 04-1] "I have found the issue, I assumed role::ci::manager and role::ci::worker would lookup from a hieradata/common/role/ci.yaml file." [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:12:52] hashar: puppet-merge asks for a patch of yours in labs-private (ci: add hiera settings for role::ci), shall I merge that one? [15:13:04] (03PS4) 10Hashar: ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) [15:13:06] !log ebysans@deploy2002 Finished deploy [analytics/refinery@f3389dc] (hadoop-test): Update pageview hourly table with referer data field TEST [analytics/refinery@f3389dc] (duration: 01m 28s) [15:13:17] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:13:31] jouncebot: nowandnext [15:13:31] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [15:13:31] In 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1600) [15:14:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [15:14:46] (03CR) 10jenkins-bot: ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:16:34] 10SRE-swift-storage, 10Patch-For-Review: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10jcrespo) I have at least some of those files, although I am unsure it is the same one as refered here or previous versions of the same name- it will need more research... [15:17:42] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) [15:19:36] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10Umherirrender) `mergeMessageFileList.php` is special, see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/905985 - but I do not know if that the real reason for this e... [15:20:39] bearloga: ack, I'll get to it shortly! [15:20:48] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) >>! In T334484#8771766, @Umherirrender wrote: > `mergeMessageFileList.php` is special, see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/905985 - but I... [15:21:03] 10SRE, 10SRE-Access-Requests: Update SSH key for Mikhail Popov - https://phabricator.wikimedia.org/T334423 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03BCornwall [15:21:54] !log installing xen security updates [15:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:15] (03CR) 10Ilias Sarantopoulos: [C: 03+1] httpbb: remove tests from liftwing production [puppet] - 10https://gerrit.wikimedia.org/r/907809 (owner: 10Elukey) [15:22:30] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs/balancer: unify hiera post bullseye upgrade (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/907897 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:22:45] brett: thank you very much! [15:23:35] hashar: ok to merge your change? [15:23:40] Antoine Musso: ci: add hiera settings for role::ci (e852faa) [15:23:54] hi :) [15:24:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T333332)', diff saved to https://phabricator.wikimedia.org/P46328 and previous config saved to /var/cache/conftool/dbconfig/20230411-152413-ladsgroup.json [15:24:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:24:18] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:24:25] I am not even sure it works [15:24:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:24:34] oh wait [15:24:36] this is labs/private [15:24:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T333332)', diff saved to https://phabricator.wikimedia.org/P46329 and previous config saved to /var/cache/conftool/dbconfig/20230411-152438-ladsgroup.json [15:24:42] going to merge then :) [15:24:42] yeah hehe [15:24:43] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) p:05Triage→03Unbreak! Going to set this to **UBN** — currently unable to run any maintenance scripts on the beta cluster, nor in production [15:25:09] and I think that specific commit is broken anyway [15:25:36] anyway, given it is labs/private.git I don't think it is going to cause any hassle on production / puppet-merge [15:26:17] yep, I misread where it is [15:26:20] sorry for theping [15:26:23] (03PS11) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [15:26:33] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:26:48] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10taavi) Wikitech instructions are wrong, you don't need to use the `run.php` wrapper. [15:30:42] (03CR) 10Hashar: [C: 04-1] "hieradata/role/common/ci/common.yaml does not work! Maybe it should instead by in hieradata/common/role/ci.yaml." [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:31:32] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:47] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:33:02] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:34:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [15:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T333332)', diff saved to https://phabricator.wikimedia.org/P46330 and previous config saved to /var/cache/conftool/dbconfig/20230411-153437-ladsgroup.json [15:34:44] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:34:55] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10hnowlan) [15:36:46] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) Thumbor-k8s is now pooled in both datacentres and, some kind of major issue notwithstanding, will remain pooled. Given the sheer age/size of this ticket,... [15:36:49] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) 05Open→03Resolved [15:37:50] (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [15:37:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [15:38:02] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [15:38:11] (03PS1) 10Vgutierrez: hiera: Increase varnish max_connections to ats-be on eqsin|ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/907912 (https://phabricator.wikimedia.org/T288106) [15:38:15] (03CR) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [15:39:45] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40594/console" [puppet] - 10https://gerrit.wikimedia.org/r/907912 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez) [15:40:05] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Ottomata) Interesting. Each purged instance is in a distinct consumer group, yes? What kafka client is it using? (Just curious, neither answers will clue me in as to why they stopped working :) ) [15:40:56] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10LSobanski) [15:41:12] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/905722 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [15:43:26] (03Merged) 10jenkins-bot: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [15:44:31] 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Vgutierrez) purged uses github.com/confluentinc/confluent-kafka-go/kafka [15:45:19] (03PS12) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [15:45:50] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10TheresNoTime) p:05Unbreak!→03Triage Was that a recent change..? [15:46:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40595/console" [puppet] - 10https://gerrit.wikimedia.org/r/907912 (https://phabricator.wikimedia.org/T288106) (owner: 10Vgutierrez) [15:47:42] (03PS7) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [15:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P46331 and previous config saved to /var/cache/conftool/dbconfig/20230411-154943-ladsgroup.json [15:54:12] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [15:54:36] (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs3006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907914 (https://phabricator.wikimedia.org/T321309) [15:54:46] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10LSobanski) p:05Triage→03Lowest [15:54:53] 10SRE, 10Phabricator, 10phabricator maintenance bot, 10serviceops-collab, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10LSobanski) p:05Lowest→03Low [15:55:07] (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [15:57:00] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs3006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907914 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:58:40] (03CR) 10RLazarus: [C: 03+1] "Very clean, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [15:58:56] (03Merged) 10jenkins-bot: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:07] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10LSobanski) p:05Triage→03Medium [16:02:09] (03CR) 10RLazarus: [C: 03+1] P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [16:03:47] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert2001 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:04:42] (03CR) 10Clément Goubert: P:httpbb: Add monitoring for kubernetes services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [16:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P46332 and previous config saved to /var/cache/conftool/dbconfig/20230411-160450-ladsgroup.json [16:04:52] (03PS14) 10Clément Goubert: P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) [16:05:32] (03PS3) 10Clément Goubert: P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) [16:05:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1132.eqiad.wmnet with OS buster [16:06:12] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster completed: - an-worker1132 (**PASS... [16:06:46] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-worker1132.eqiad.wmnet with reason: More tests are needed before the host can be added to prod [16:07:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1132.eqiad.wmnet with reason: More tests are needed before the host can be added to prod [16:08:11] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Jelto) [16:08:55] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:09:17] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:09:22] (03PS15) 10Clément Goubert: P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) [16:10:09] (03PS4) 10Clément Goubert: P:httpbb: Remove absented httpbb_kubernetes_hourly [puppet] - 10https://gerrit.wikimedia.org/r/907848 (https://phabricator.wikimedia.org/T334456) [16:11:37] !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:12:36] !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:14:11] (03PS1) 10Jbond: pcc dosen't like binary data [labs/private] - 10https://gerrit.wikimedia.org/r/907918 [16:14:15] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10elukey) Reimaged the node, but I still see 11 4TB disks and not 12. Mega cli shows 12 phisical disks but only 11 VDs, so probably we'll need to fix it. I downtimed the... [16:16:50] 10SRE, 10serviceops: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) Final step - check if we have to migrate deployment-prep or not. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/905954, some hiera settings may need to be added if we want to kee... [16:17:24] (03PS1) 10Jbond: puppet_compiler: improve warning message [puppet] - 10https://gerrit.wikimedia.org/r/907919 [16:17:45] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: improve warning message [puppet] - 10https://gerrit.wikimedia.org/r/907919 (owner: 10Jbond) [16:18:33] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:19:28] !log Disable Puppet/PyBal on lvs3006 in preparation for reimaging - T321309 [16:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:31] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:19:32] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:19:35] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs3006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/907914 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [16:19:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T333332)', diff saved to https://phabricator.wikimedia.org/P46333 and previous config saved to /var/cache/conftool/dbconfig/20230411-161956-ladsgroup.json [16:19:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:20:01] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:20:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:20:17] jbond: Okay to merge modules/puppet_compiler/files/puppet_master_pup-8187.rb.nocheck change? [16:20:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T333332)', diff saved to https://phabricator.wikimedia.org/P46334 and previous config saved to /var/cache/conftool/dbconfig/20230411-162020-ladsgroup.json [16:20:45] brett: yes please [16:21:11] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:21:19] Hey brett and jbond - is it ok if I deploy a security patch from /private or should I wait until the puppet deploy window is done? [16:21:48] sbassett: no problem from me [16:22:31] sbassett: yes fine [16:22:59] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [16:23:07] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:23:23] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:23:23] Thanks [16:23:25] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:23:27] PROBLEM - pybal on lvs3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:27:25] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@ce3d4d6]: (no justification provided) [16:27:36] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@ce3d4d6]: (no justification provided) (duration: 00m 11s) [16:29:44] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:30:08] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:30:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T333332)', diff saved to https://phabricator.wikimedia.org/P46335 and previous config saved to /var/cache/conftool/dbconfig/20230411-163018-ladsgroup.json [16:30:23] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:32:04] XioNoX: Hi Arzhel! I'm reaching out to ask about the network_flows_internal dataset. We Data Engineering are migrating its jobs to Airflow right now, and I saw that this dataset had one sanitization job implemented but disabled since the beginning until now. [16:33:01] mforns: in theory it should only be non PII data [16:33:14] traffic data between our own servers [16:33:23] ok, so you think we can just remove that sanitization part right? [16:33:30] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:33:48] mforns: yep [16:33:54] !log Deployed security mitigation update for T333140 [16:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:59] ok, cool, thanks a lot XioNoX [16:34:03] no pb! [16:39:59] PROBLEM - MegaRAID on an-worker1110 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:45:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P46336 and previous config saved to /var/cache/conftool/dbconfig/20230411-164524-ladsgroup.json [16:45:48] (03PS28) 10KartikMistry: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [16:47:27] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [16:48:07] (03CR) 10Ottomata: profile::kafka::broker: refactor TLS settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [16:49:59] (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [16:51:17] (03PS1) 10BBlack: Collapse duplicate leading slashes in URIs [puppet] - 10https://gerrit.wikimedia.org/r/907926 [16:51:52] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [16:53:13] (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [16:56:20] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [16:57:12] (03PS2) 10BBlack: Collapse duplicate leading slashes in URIs [puppet] - 10https://gerrit.wikimedia.org/r/907926 [16:57:35] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) 05Resolved→03Open Let me reopen the ticket then since you don't have access that you should have. We have rotating clinic duty each week to handle open acces... [16:59:43] (03PS1) 10Hnowlan: rest-gateway: fix lua handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/907928 (https://phabricator.wikimedia.org/T326321) [17:00:02] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) @FNavas-foundation So looking back at the original request it actually said turnilo. Does https://turnilo.wikimedia.org/ work for you? But you also want https://... [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1700) [17:00:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P46337 and previous config saved to /var/cache/conftool/dbconfig/20230411-170031-ladsgroup.json [17:00:43] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) a:05ssingh→03None [17:15:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T333332)', diff saved to https://phabricator.wikimedia.org/P46338 and previous config saved to /var/cache/conftool/dbconfig/20230411-171537-ladsgroup.json [17:15:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [17:15:43] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:15:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [17:16:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T333332)', diff saved to https://phabricator.wikimedia.org/P46339 and previous config saved to /var/cache/conftool/dbconfig/20230411-171600-ladsgroup.json [17:17:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3006.esams.wmnet with OS bullseye [17:17:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs3006.esams.wmnet with OS bullseye [17:19:57] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox, 10Patch-For-Review: Issues converting services from active/passive to active/active - https://phabricator.wikimedia.org/T330084 (10jbond) from @brandon via irc >it *seems* like that error in the ticket would've only happened if the puppet agent... [17:21:15] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-puppet-ca-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:33] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) > Maybe it is because you don't have a kerberos access token. We should verify again with DE what is needed for you. Kerberos access is not required for supe... [17:24:23] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T333332)', diff saved to https://phabricator.wikimedia.org/P46340 and previous config saved to /var/cache/conftool/dbconfig/20230411-172604-ladsgroup.json [17:26:11] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:34:29] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (esams) [puppet] - 10https://gerrit.wikimedia.org/r/907931 (https://phabricator.wikimedia.org/T321309) [17:38:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:38:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3006.esams.wmnet with reason: host reimage [17:39:33] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert2001 is OK: (C)1e+05 gt (W)1e+04 gt 5203 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [17:41:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P46341 and previous config saved to /var/cache/conftool/dbconfig/20230411-174110-ladsgroup.json [17:42:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3006.esams.wmnet with reason: host reimage [17:47:01] (03PS13) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [17:51:16] (03CR) 10Stevemunene: [C: 03+1] aqs: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/907718 (owner: 10Muehlenhoff) [17:52:27] (03PS14) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [17:54:55] (03PS1) 10Jforrester: [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) [17:55:21] (03PS1) 10DCausse: rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 [17:55:36] (03CR) 10CI reject: [V: 04-1] [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) (owner: 10Jforrester) [17:56:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P46342 and previous config saved to /var/cache/conftool/dbconfig/20230411-175617-ladsgroup.json [17:57:17] (03PS2) 10Jforrester: [Beta Cluster] Replicate WebResponseSetCookie wgHooks migration here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907933 (https://phabricator.wikimedia.org/T333926) [17:57:31] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:59:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3006.esams.wmnet with OS bullseye [17:59:34] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs3006.esams.wmnet with OS bullseye completed: - lvs3006 (**PASS**) - Downtimed on Icinga/Aler... [18:00:04] ^demon and hashar: (Dis)respected human, time to deploy MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1800). Please do the needful. [18:00:36] (03PS1) 10Andrew Bogott: Initial puppet setup for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/907935 (https://phabricator.wikimedia.org/T329863) [18:01:11] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 17 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) [18:01:19] (03CR) 10CI reject: [V: 04-1] Initial puppet setup for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/907935 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [18:02:48] (03PS2) 10Andrew Bogott: Initial puppet setup for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/907935 (https://phabricator.wikimedia.org/T329863) [18:04:23] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10RobH) [18:04:33] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10RobH) [18:04:46] (03CR) 10Andrew Bogott: [C: 03+2] Initial puppet setup for cloudvirtlocal100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/907935 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [18:04:53] (03CR) 10RLazarus: [C: 03+1] P:httpbb: Add monitoring for kubernetes services [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [18:06:02] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10RobH) [18:10:26] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10RobH) [18:11:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T333332)', diff saved to https://phabricator.wikimedia.org/P46343 and previous config saved to /var/cache/conftool/dbconfig/20230411-181123-ladsgroup.json [18:11:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:11:29] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:11:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:13:41] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Jgreen) [18:14:01] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) a:03BCornwall fnavas-foundation is a member of analytics-privatedata-users but something is not working or there is still confusion about which user to use. Co... [18:16:12] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) >>! In T332650#8747386, @Krinkle w... [18:16:48] (03PS1) 10Zabe: close wowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907936 (https://phabricator.wikimedia.org/T334482) [18:17:28] (03CR) 10CI reject: [V: 04-1] close wowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907936 (https://phabricator.wikimedia.org/T334482) (owner: 10Zabe) [18:20:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [18:20:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:20:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [18:20:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:20:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46344 and previous config saved to /var/cache/conftool/dbconfig/20230411-182024-ladsgroup.json [18:20:25] (03PS1) 10Gergő Tisza: multi-dc: Improve OAuth URL patterns for routing to primary [puppet] - 10https://gerrit.wikimedia.org/r/907937 (https://phabricator.wikimedia.org/T332650) [18:20:28] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:20:29] (03CR) 10Kamila Součková: [C: 03+1] "LGTM (I guess I could have noticed that earlier...)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/907928 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [18:20:33] (03PS2) 10Zabe: close wowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907936 (https://phabricator.wikimedia.org/T334482) [18:21:16] (03PS15) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:21:20] (03PS1) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [18:22:45] jouncebot: nowandnext [18:22:45] For the next 1 hour(s) and 37 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T1800) [18:22:45] In 1 hour(s) and 37 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T2000) [18:23:38] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:23:47] (03CR) 10Zabe: [C: 03+2] close wowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907936 (https://phabricator.wikimedia.org/T334482) (owner: 10Zabe) [18:24:35] (03Merged) 10jenkins-bot: close wowikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907936 (https://phabricator.wikimedia.org/T334482) (owner: 10Zabe) [18:24:37] (03PS2) 10Gergő Tisza: multi-dc: Improve OAuth URL patterns for routing to primary [puppet] - 10https://gerrit.wikimedia.org/r/907937 (https://phabricator.wikimedia.org/T332650) [18:24:50] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [18:25:24] !log zabe@deploy2002 Started scap: close wowikiquote (T334482) [18:25:28] T334482: Close wo.wikiquote - https://phabricator.wikimedia.org/T334482 [18:25:30] (03CR) 10Dzahn: "@Hashar - the issue comes from including a role inside another role. instead the "common" part between both roles should be a profile. Mak" [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [18:25:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) ` Booting from BRCM MBA Slot 8A00 v223.0.186.0 Broadcom UNDI PXE-2.1 v223.0.186.0 Copyright (C) 2000-2022 Broadcom Limited Copyright (C) 1997... [18:26:24] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) Would be nice to rely on the Lua plugin for URL de... [18:27:37] (03PS1) 10Majavah: ssh: extract enabled key types to a parameter [puppet] - 10https://gerrit.wikimedia.org/r/907939 [18:27:39] (03PS1) 10Majavah: ssh: add support for using a CA for host keys [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) [18:27:55] (03CR) 10Dzahn: "per your previous comment, should be done once the role is applied. this has happened meanwhile. so adding you back now." [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [18:28:28] (03CR) 10CI reject: [V: 04-1] ssh: add support for using a CA for host keys [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [18:28:39] (03Abandoned) 10Dzahn: switch doc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/893578 (https://phabricator.wikimedia.org/T330963) (owner: 10Dzahn) [18:29:18] (03PS2) 10Majavah: ssh: add support for using a CA for host keys [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) [18:31:38] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907941 (https://phabricator.wikimedia.org/T330210) [18:31:40] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907941 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:32:10] !log zabe@deploy2002 Finished scap: close wowikiquote (T334482) (duration: 06m 46s) [18:32:15] T334482: Close wo.wikiquote - https://phabricator.wikimedia.org/T334482 [18:32:23] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/907941 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:32:28] I'm done (sorry for stealing your window) [18:33:46] (03CR) 10Dzahn: "any thoughts on this? should it still wait?" [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [18:35:45] (03CR) 10Dzahn: [C: 03+1] "Yea, that all makes sense and I am aware it's currenty down and that that we need to work on it regardless and match it to Horizon setting" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [18:37:43] <^demon> zabe: That shouldn't have been possible, I'm mid-scap.... [18:38:10] (03CR) 10Dzahn: [C: 03+2] miscweb: remove iegreview profile from role/hiera/tests [puppet] - 10https://gerrit.wikimedia.org/r/907509 (https://phabricator.wikimedia.org/T334415) (owner: 10Dzahn) [18:38:17] <^demon> I started scap like a minute before you log'd it. [18:38:20] (03PS2) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [18:38:47] !log demon@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.4 refs T330210 [18:38:52] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [18:39:17] (03PS16) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:40:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40599/console" [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [18:41:14] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [18:50:45] !log andrew@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1001'] [18:51:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46345 and previous config saved to /var/cache/conftool/dbconfig/20230411-185110-ladsgroup.json [18:51:11] (03PS1) 10BCornwall: admin: Update SSH key for Mikhail Popov (bearloga) [puppet] - 10https://gerrit.wikimedia.org/r/907946 (https://phabricator.wikimedia.org/T334423) [18:51:14] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:52:18] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [18:55:28] (03PS1) 10Dzahn: deployment_server: remove iegreview with source phabricator [puppet] - 10https://gerrit.wikimedia.org/r/907947 (https://phabricator.wikimedia.org/T334415) [18:57:29] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirtlocal1001'] [18:58:37] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:58:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [18:59:48] !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@d2cd28d]: (no justification provided) [18:59:59] !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@d2cd28d]: (no justification provided) (duration: 00m 11s) [19:00:13] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [19:01:52] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:02:24] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [19:03:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:05:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:05:10] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:05:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [19:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P46346 and previous config saved to /var/cache/conftool/dbconfig/20230411-190616-ladsgroup.json [19:08:53] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:08:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:10:50] !log andrew@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1002'] [19:16:07] (03PS17) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:16:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirtlocal1002'] [19:16:32] !log andrew@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1003'] [19:16:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew) I downgraded the firmware on all three hosts but still haven't had dhcp success. [19:20:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.646 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P46347 and previous config saved to /var/cache/conftool/dbconfig/20230411-192122-ladsgroup.json [19:21:56] (03PS18) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:22:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirtlocal1003'] [19:23:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:59] (03PS19) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:32:09] (03PS20) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:33:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:56] (03CR) 10Krinkle: [C: 03+1] multi-dc: Improve OAuth URL patterns for routing to primary [puppet] - 10https://gerrit.wikimedia.org/r/907937 (https://phabricator.wikimedia.org/T332650) (owner: 10Gergő Tisza) [19:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T333332)', diff saved to https://phabricator.wikimedia.org/P46348 and previous config saved to /var/cache/conftool/dbconfig/20230411-193628-ladsgroup.json [19:36:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [19:36:34] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:36:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [19:36:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46349 and previous config saved to /var/cache/conftool/dbconfig/20230411-193640-ladsgroup.json [19:36:54] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) [19:38:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:45] (03PS1) 10Ssingh: admin: actually add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907965 (https://phabricator.wikimedia.org/T333863) [19:39:12] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) p:05Triage→03Medium a:03KMorgan-WMF Updated to use the template suggested by wikitech. @KMorgan-WMF I don't see your signature on the L3 acknowledgement list;... [19:40:46] (03CR) 10BCornwall: [C: 03+1] admin: actually add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907965 (https://phabricator.wikimedia.org/T333863) (owner: 10Ssingh) [19:41:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) @Trizek-WMF: Apologies, it was pointed out to that I missed adding you to the actual group. This should be fixed soon. Thank you and sorry! [19:42:13] (03CR) 10Ssingh: [C: 03+1] "+1 and also for the out-of-band confirmation :)" [puppet] - 10https://gerrit.wikimedia.org/r/907946 (https://phabricator.wikimedia.org/T334423) (owner: 10BCornwall) [19:43:07] (03PS1) 10BCornwall: admin: kmorgan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907967 (https://phabricator.wikimedia.org/T334432) [19:43:18] (03CR) 10Raymond Ndibe: maintain_dbusers: move all the files under service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro) [19:43:20] (03PS21) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:44:27] (03CR) 10Ssingh: "I am guessing similar to the last one, we should probably restart ATS? Let me know when you plan on deploying and I can take care of that." [puppet] - 10https://gerrit.wikimedia.org/r/907937 (https://phabricator.wikimedia.org/T332650) (owner: 10Gergő Tisza) [19:44:51] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) @nettrom_WMF Are you the approving party (manager) of @KMorgan-WMF? [19:44:55] (03CR) 10Brennen Bearnes: [C: 03+1] deployment_server: remove iegreview with source phabricator [puppet] - 10https://gerrit.wikimedia.org/r/907947 (https://phabricator.wikimedia.org/T334415) (owner: 10Dzahn) [19:45:48] (03PS2) 10DCausse: rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 [19:46:34] (03CR) 10Ssingh: [C: 03+2] admin: actually add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/907965 (https://phabricator.wikimedia.org/T333863) (owner: 10Ssingh) [19:47:21] (03PS22) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:55:47] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40600/console" [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230411T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:03:47] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:12] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [20:05:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [20:07:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46350 and previous config saved to /var/cache/conftool/dbconfig/20230411-200720-ladsgroup.json [20:07:25] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:07:50] (03PS23) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [20:10:28] (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe) [20:10:38] (03CR) 10Gergő Tisza: multi-dc: Improve OAuth URL patterns for routing to primary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/907937 (https://phabricator.wikimedia.org/T332650) (owner: 10Gergő Tisza) [20:18:37] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) We have 3M total OAuth errors in the last 30 days... [20:19:35] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@fcc4c9b]: (no justification provided) [20:19:46] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@fcc4c9b]: (no justification provided) (duration: 00m 11s) [20:22:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P46351 and previous config saved to /var/cache/conftool/dbconfig/20230411-202227-ladsgroup.json [20:24:41] is it too late for me to schedule a config patch in this window? [20:25:12] I can add it to the wiki .. checking if someone might be around to help with deploy ... https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/896104 [20:25:57] (03PS7) 10EoghanGaffney: Add keys for sshd-gitlab from the secrets repo [puppet] - 10https://gerrit.wikimedia.org/r/907878 (https://phabricator.wikimedia.org/T333840) [20:26:30] (03PS5) 10Subramanya Sastry: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [20:31:40] taavi, RoanKattouw TheresNoTime ... see qn. above. [20:33:44] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) miscweb1002 removed - buster hosts -1 [20:37:16] (03PS1) 10Bartosz Dziewoński: Only log 'visualEditorFeatureUse' events if 'editAttemptStep' events are being logged [extensions/WikimediaEvents] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/907743 (https://phabricator.wikimedia.org/T334157) [20:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P46352 and previous config saved to /var/cache/conftool/dbconfig/20230411-203733-ladsgroup.json [20:39:06] 10SRE, 10LDAP-Access-Requests, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10BCornwall) a:03MarcoAurelio [20:39:09] 10SRE, 10LDAP-Access-Requests, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10BCornwall) 05Open→03Stalled [20:39:44] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10BCornwall) 05Open→03In progress [20:39:50] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) This ticket is about getting gerrit1003 into production. T334521 is about not having any gerrit servers on buster. Over there I described the current situation. We can now try to get gerri... [20:40:01] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10BCornwall) p:05Triage→03Medium [20:40:18] 10SRE, 10SRE-Access-Requests: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10BCornwall) p:05Triage→03Medium [20:41:22] 10SRE, 10SRE-Access-Requests, 10User-MarcoAurelio: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10BCornwall) a:03MarcoAurelio [20:45:07] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) 05Stalled→03In progress [20:45:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Dzahn) [20:45:32] 10SRE, 10SRE-Access-Requests, 10User-MarcoAurelio: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10BCornwall) 05Open→03Stalled [20:46:01] (03CR) 10BCornwall: [C: 03+2] admin: Update SSH key for Mikhail Popov (bearloga) [puppet] - 10https://gerrit.wikimedia.org/r/907946 (https://phabricator.wikimedia.org/T334423) (owner: 10BCornwall) [20:46:36] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [20:46:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for Mikhail Popov - https://phabricator.wikimedia.org/T334423 (10BCornwall) 05In progress→03Resolved [20:46:59] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [20:47:01] Never mind ... I'll schedule this for tomorrow. [20:49:28] (03CR) 10Ladsgroup: [C: 03+2] codesearch: Add 'devtools' instance (split from 'operations') [puppet] - 10https://gerrit.wikimedia.org/r/902881 (https://phabricator.wikimedia.org/T303434) (owner: 10Krinkle) [20:49:42] (03PS3) 10Ladsgroup: codesearch: Add 'devtools' instance (split from 'operations') [puppet] - 10https://gerrit.wikimedia.org/r/902881 (https://phabricator.wikimedia.org/T303434) (owner: 10Krinkle) [20:49:49] (03CR) 10Ladsgroup: [V: 03+2] codesearch: Add 'devtools' instance (split from 'operations') [puppet] - 10https://gerrit.wikimedia.org/r/902881 (https://phabricator.wikimedia.org/T303434) (owner: 10Krinkle) [20:49:59] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) a:05LSobanski→03Dzahn [20:50:22] 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10BCornwall) p:05Triage→03Medium [20:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T333332)', diff saved to https://phabricator.wikimedia.org/P46353 and previous config saved to /var/cache/conftool/dbconfig/20230411-205239-ladsgroup.json [20:52:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:52:45] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:52:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:59:38] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) gerrit machines have 2 (public) IP addresses by design. For example the current production server has: - 208.80.154.136 (gerrit1001.wikimedia.org) - 208.80.154.137 (gerrit.wikimedia.org) N... [21:06:09] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) that being said.. It might not work to use 208.80.154.135 and 208.80.153.107 on the same interface if they are in different subnets [21:06:49] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:13:02] (03PS1) 10Andrew Bogott: Added fake profile::toolforge::disable_tool::disable_tool_db_password [labs/private] - 10https://gerrit.wikimedia.org/r/907982 (https://phabricator.wikimedia.org/T332514) [21:13:47] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added fake profile::toolforge::disable_tool::disable_tool_db_password [labs/private] - 10https://gerrit.wikimedia.org/r/907982 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott) [21:14:32] (03PS1) 10Andrew Bogott: Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) [21:14:54] (03CR) 10CI reject: [V: 04-1] Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott) [21:16:58] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10BCornwall) @FNavas-foundation let's make sure I understand: * You can access superset with FNavas-foundation * Your need is to access Turnilo **and** superset * `analy... [21:18:08] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10BCornwall) a:05BCornwall→03FNavas-foundation [21:18:41] (03PS2) 10Andrew Bogott: Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) [21:19:04] (03CR) 10CI reject: [V: 04-1] Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) (owner: 10Andrew Bogott) [21:20:08] (03PS3) 10Andrew Bogott: Add database config for disable_tool process [puppet] - 10https://gerrit.wikimedia.org/r/907983 (https://phabricator.wikimedia.org/T332514) [21:20:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [21:20:41] (03PS1) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) [21:20:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [21:20:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46354 and previous config saved to /var/cache/conftool/dbconfig/20230411-212053-ladsgroup.json [21:20:58] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:21:46] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:27] (03PS1) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus5002 [puppet] - 10https://gerrit.wikimedia.org/r/907985 (https://phabricator.wikimedia.org/T309979) [21:25:32] (JobUnavailable) firing: Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:27:42] (03PS1) 10Andrea Denisse: prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) [21:32:23] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [21:34:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:35:31] (03PS3) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [21:35:32] (JobUnavailable) resolved: Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:35:33] (03PS24) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [21:35:35] (03PS1) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [21:36:16] (03CR) 10CI reject: [V: 04-1] environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [21:36:45] PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:36:46] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:38:33] (03PS25) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [21:39:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:08] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [21:47:23] (03PS26) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [21:50:34] (03PS1) 10Bking: airflow: Make Data Engineering primary contact [puppet] - 10https://gerrit.wikimedia.org/r/907992 (https://phabricator.wikimedia.org/T334522) [21:51:07] (03PS27) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [21:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46355 and previous config saved to /var/cache/conftool/dbconfig/20230411-215132-ladsgroup.json [21:51:38] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:52:28] (03PS28) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [21:52:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) @Papaul swapped CPU1 and CPU2. all of the DIMM have been reseated. Powered back on and log has been cleared. [21:54:28] (03CR) 10Jbond: puppetserver: (WIP) add basic class for puppert server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [21:56:42] (03CR) 10Jbond: puppetserver: (WIP) add basic class for puppert server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [22:04:41] (03CR) 10Dzahn: [C: 03+2] deployment_server: remove iegreview with source phabricator [puppet] - 10https://gerrit.wikimedia.org/r/907947 (https://phabricator.wikimedia.org/T334415) (owner: 10Dzahn) [22:06:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P46356 and previous config saved to /var/cache/conftool/dbconfig/20230411-220638-ladsgroup.json [22:17:25] (03PS1) 10Jbond: cloud - hiera: allow ops and sre-admins to login to all puppet-diff servers [puppet] - 10https://gerrit.wikimedia.org/r/907994 [22:18:03] (03CR) 10Dzahn: [C: 03+2] "copied from bastion host project, restricted bastion" [puppet] - 10https://gerrit.wikimedia.org/r/907994 (owner: 10Jbond) [22:19:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Papaul) @Jhancock.wm thank you. We will leave the task open until the end of the week to see if we do have any errors on CPU1 [22:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P46357 and previous config saved to /var/cache/conftool/dbconfig/20230411-222145-ladsgroup.json [22:35:21] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40605/console" [puppet] - 10https://gerrit.wikimedia.org/r/905705 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [22:36:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T333332)', diff saved to https://phabricator.wikimedia.org/P46358 and previous config saved to /var/cache/conftool/dbconfig/20230411-223651-ladsgroup.json [22:36:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:36:57] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [22:37:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:37:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:37:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T333332)', diff saved to https://phabricator.wikimedia.org/P46359 and previous config saved to /var/cache/conftool/dbconfig/20230411-223732-ladsgroup.json [22:53:11] PROBLEM - eventlogging Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:53:11] PROBLEM - Webrequests Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:53:11] PROBLEM - statsv Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:53:11] PROBLEM - Check systemd state on cp3056 is CRITICAL: CRITICAL - degraded: The following units failed: varnishkafka-eventlogging.service,varnishkafka-statsv.service,varnishkafka-webrequest.service,varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:51] PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:51] RECOVERY - eventlogging Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:54:51] RECOVERY - Webrequests Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:54:51] RECOVERY - statsv Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [22:54:51] RECOVERY - Check systemd state on cp3056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:07] RECOVERY - Check systemd state on cp3058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:16] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) [23:03:21] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10leila) removed #Research tag, added T334511 as a subtask for us to take care of one item we should help you all with. I'm coordinating with... [23:04:35] PROBLEM - eventlogging Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:04:37] PROBLEM - Webrequests Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:04:37] PROBLEM - Check systemd state on cp3056 is CRITICAL: CRITICAL - degraded: The following units failed: varnishkafka-eventlogging.service,varnishkafka-statsv.service,varnishkafka-webrequest.service,varnishmtail@default.service,varnishmtail@internal.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:37] PROBLEM - statsv Varnishkafka log producer on cp3056 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:06:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T333332)', diff saved to https://phabricator.wikimedia.org/P46360 and previous config saved to /var/cache/conftool/dbconfig/20230411-230643-ladsgroup.json [23:06:48] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:21:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P46361 and previous config saved to /var/cache/conftool/dbconfig/20230411-232149-ladsgroup.json [23:25:41] RECOVERY - eventlogging Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:25:41] RECOVERY - Check systemd state on cp3056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:41] RECOVERY - statsv Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:25:41] RECOVERY - Webrequests Varnishkafka log producer on cp3056 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [23:36:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P46362 and previous config saved to /var/cache/conftool/dbconfig/20230411-233655-ladsgroup.json [23:38:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [23:39:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [23:39:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46363 and previous config saved to /var/cache/conftool/dbconfig/20230411-233930-ladsgroup.json [23:39:35] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T333332)', diff saved to https://phabricator.wikimedia.org/P46364 and previous config saved to /var/cache/conftool/dbconfig/20230411-234038-ladsgroup.json [23:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T333332)', diff saved to https://phabricator.wikimedia.org/P46365 and previous config saved to /var/cache/conftool/dbconfig/20230411-235202-ladsgroup.json [23:52:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [23:52:07] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [23:52:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [23:52:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T333332)', diff saved to https://phabricator.wikimedia.org/P46366 and previous config saved to /var/cache/conftool/dbconfig/20230411-235225-ladsgroup.json [23:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P46367 and previous config saved to /var/cache/conftool/dbconfig/20230411-235544-ladsgroup.json