[00:00:12] !log zabe@deploy1002 zabe: Backport for [[gerrit:917416|Start writing to af_actor/afh_actor everywhere (T334295)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [00:03:33] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:37] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:14] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:917416|Start writing to af_actor/afh_actor everywhere (T334295)]] (duration: 07m 22s) [00:06:18] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [00:06:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:07:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.156 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:11:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916915 [00:39:17] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916915 (owner: 10TrainBranchBot) [00:58:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916915 (owner: 10TrainBranchBot) [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336213 (10phaultfinder) [01:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0200) [02:03:24] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.8 [core] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/916916 (https://phabricator.wikimedia.org/T330214) [02:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.8 [core] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/916916 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [02:07:54] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:14:49] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:54] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:07] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.8 [core] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/916916 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0300) [03:01:20] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917428 (https://phabricator.wikimedia.org/T330214) [03:01:22] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917428 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [03:02:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917428 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [03:02:36] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.8 refs T330214 [03:02:41] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [03:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:24:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:32:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:33:09] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:33:55] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:50:31] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.8 refs T330214 (duration: 47m 55s) [03:50:35] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [03:54:41] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:35] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:37:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:42:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:42:55] PROBLEM - PHP7 rendering on mw1467 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:44:21] RECOVERY - PHP7 rendering on mw1467 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:44:59] PROBLEM - PHP7 jobrunner on mw1467 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:45:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:46:25] RECOVERY - PHP7 jobrunner on mw1467 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:50:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:04:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:19:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:07] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) Sorry Papaul, I was already out. Let me know when you'd like to tackle this so I can have the host down for you [05:23:50] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Marostegui) @Jhancock.wm sure - let me know when do you want to do this so I can have the host down for you. [05:24:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Primary switchover db_inventory T335014 [05:24:24] T335014: Switchover db1115 -> db1215 - https://phabricator.wikimedia.org/T335014 [05:24:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Primary switchover db_inventory T335014 [05:25:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1215 to zarcillo master [puppet] - 10https://gerrit.wikimedia.org/r/917323 (https://phabricator.wikimedia.org/T335014) (owner: 10Marostegui) [05:26:55] (03PS2) 10Marostegui: orchestrator: Change database [puppet] - 10https://gerrit.wikimedia.org/r/917313 (https://phabricator.wikimedia.org/T334455) [05:27:03] (03PS2) 10Marostegui: switchover.py: Replace zarcillo host [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917320 (https://phabricator.wikimedia.org/T334455) [05:28:41] !log Starting db-inventory eqiad failover from db1115 to db1215 - T335014 [05:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:53] (03CR) 10Marostegui: [C: 03+2] orchestrator: Change database [puppet] - 10https://gerrit.wikimedia.org/r/917313 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [05:33:13] (03CR) 10Marostegui: [C: 03+2] switchover.py: Replace zarcillo host [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/917320 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [05:35:33] (03PS1) 10Marostegui: report_users.sh: Change zarcillo host [software] - 10https://gerrit.wikimedia.org/r/917434 (https://phabricator.wikimedia.org/T334455) [05:36:17] (03PS3) 10Marostegui: .bashrc: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [05:37:21] (03CR) 10Marostegui: [C: 03+2] .bashrc: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [05:38:01] (03PS2) 10Marostegui: report_users.sh: Change zarcillo host [software] - 10https://gerrit.wikimedia.org/r/917434 (https://phabricator.wikimedia.org/T334455) [05:39:04] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Change zarcillo host [software] - 10https://gerrit.wikimedia.org/r/917434 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0600) [06:00:04] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0600). [06:00:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [06:03:24] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:34] (03CR) 10Ayounsi: [C: 03+1] "Time to give it a try in prod :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [06:08:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [06:11:53] (03CR) 10Ayounsi: "Post merge. But a more sustainable health check is usually to check higher level protocols, for example an HTTP check, so it's closer to t" [puppet] - 10https://gerrit.wikimedia.org/r/917302 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [06:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:13:04] * kart_ updating cxserver.. [06:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:13:58] (03CR) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [06:14:44] Oh, there is ongoing deployment window. I'll hold my deployment. [06:15:12] (03PS1) 10Marostegui: db1115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917587 (https://phabricator.wikimedia.org/T334455) [06:15:19] kart_: if it is ours....you can proceed. We are not deploying anything [06:15:59] jouncebot: now [06:15:59] For the next 0 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0600) [06:16:00] For the next 0 hour(s) and 14 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0600) [06:16:11] So the Primary Database switchover one is finished [06:16:18] The MW one, I don't know :) [06:16:24] (03CR) 10Marostegui: [C: 03+2] db1115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917587 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [06:17:54] (03PS1) 10Marostegui: install_server: Do not reimage db1213 [puppet] - 10https://gerrit.wikimedia.org/r/917686 [06:18:27] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1213 [puppet] - 10https://gerrit.wikimedia.org/r/917686 (owner: 10Marostegui) [06:18:30] marostegui: Thanks! [06:19:14] (03CR) 10Ayounsi: "one post merge naming comment." [homer/public] - 10https://gerrit.wikimedia.org/r/917369 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney) [06:19:34] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns2001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/917364 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [06:23:31] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:09] (03PS3) 10Marostegui: Define dummy pass for passwords::excimer_ui_server [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [06:34:01] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 126 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:35:56] (03CR) 10Marostegui: [V: 03+2 C: 03+2] Define dummy pass for passwords::excimer_ui_server [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [06:36:50] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "Done!" [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [06:36:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [06:46:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/917394 (https://phabricator.wikimedia.org/T94022) (owner: 10Majavah) [06:46:53] (03CR) 10Filippo Giunchedi: [C: 03+2] P:wmcs::toolserver_legacy: convert icinga checks to blackbox probes [puppet] - 10https://gerrit.wikimedia.org/r/917394 (https://phabricator.wikimedia.org/T94022) (owner: 10Majavah) [06:47:30] (03PS3) 10Filippo Giunchedi: P:wmcs::toolserver_legacy: convert icinga checks to blackbox probes [puppet] - 10https://gerrit.wikimedia.org/r/917394 (https://phabricator.wikimedia.org/T94022) (owner: 10Majavah) [06:48:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netmon1003.wikimedia.org [06:49:19] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:53:17] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:11] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [06:57:15] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:06] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0700). [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:02:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [07:02:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:04:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/917358 (https://phabricator.wikimedia.org/T334154) (owner: 10Jbond) [07:12:54] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:13:24] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:19:29] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:27:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:32:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:37:38] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:38:40] (03CR) 10Muehlenhoff: [C: 03+2] service::node: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908226 (owner: 10Muehlenhoff) [07:41:31] (03CR) 10Volans: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [07:44:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [07:48:32] (03PS1) 10Muehlenhoff: Obsolete profile::python37 [puppet] - 10https://gerrit.wikimedia.org/r/917813 [07:49:17] (03CR) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [07:49:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/917813 (owner: 10Muehlenhoff) [07:54:05] (03PS1) 10Jelto: miscweb annualreport: update redirect for 2022 report [puppet] - 10https://gerrit.wikimedia.org/r/917814 (https://phabricator.wikimedia.org/T336217) [07:59:19] (03PS2) 10KartikMistry: Update cxserver to 2023-05-08-134152-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917338 (https://phabricator.wikimedia.org/T336115) [08:00:06] hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0800). [08:01:06] (03PS3) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) [08:02:01] (03CR) 10Jelto: "I added annual.wikimedia.org to the ingress config, as this is used too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [08:04:11] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Release v3.2.9-wmf2 to production - volans@cumin1001 - T314933 [08:04:15] T314933: Upgrade Netbox to latest 3.2 - https://phabricator.wikimedia.org/T314933 [08:05:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-rw2001.wikimedia.org [08:05:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:08:26] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Release v3.2.9-wmf2 to production - volans@cumin1001 - T314933 [08:09:02] (03CR) 10JMeybohm: [C: 04-1] New wikikube service: mediawiki-page-content-change-enrichment - staging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [08:12:01] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-rw2001.wikimedia.org - jmm@cumin2002" [08:12:01] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-rw2001.wikimedia.org - jmm@cumin2002" [08:12:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:12:45] 10SRE, 10Machine-Learning-Team, 10MinT, 10serviceops, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10akosiaris) [08:12:54] (JobUnavailable) firing: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:13:46] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: netbox upgrade [08:13:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: netbox upgrade [08:13:53] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Release v3.2.9-wmf2 to production - volans@cumin1001 - T314933 [08:13:57] T314933: Upgrade Netbox to latest 3.2 - https://phabricator.wikimedia.org/T314933 [08:14:58] (03PS2) 10Alexandros Kosiaris: service::catalog: Add machinetranslation service [puppet] - 10https://gerrit.wikimedia.org/r/913152 (https://phabricator.wikimedia.org/T331505) [08:15:51] (03PS2) 10Alexandros Kosiaris: Add machinetranslation service RRs [dns] - 10https://gerrit.wikimedia.org/r/914351 (https://phabricator.wikimedia.org/T331505) [08:16:53] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Release v3.2.9-wmf2 to production - volans@cumin1001 - T314933 [08:17:11] 3.2.9 [08:17:14] wrong tab [08:17:46] (03PS3) 10Alexandros Kosiaris: service::catalog: Add machinetranslation service [puppet] - 10https://gerrit.wikimedia.org/r/913152 (https://phabricator.wikimedia.org/T331505) [08:18:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:19:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:19:21] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-rw2001.wikimedia.org on all recursors [08:19:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-rw2001.wikimedia.org on all recursors [08:19:49] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-rw2001.wikimedia.org - jmm@cumin2002" [08:20:42] (03PS2) 10Barakat Ajadi: CentralNoticeTiming: remove central timing [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) [08:21:10] (03CR) 10Hashar: ci: use an array to manage gitcache repos (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [08:21:36] (03PS4) 10Hashar: ci: use an array to manage gitcache repos [puppet] - 10https://gerrit.wikimedia.org/r/916514 [08:21:38] (03PS5) 10Hashar: ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 [08:21:40] (03PS4) 10Hashar: ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 [08:22:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::catalog: Add machinetranslation service [puppet] - 10https://gerrit.wikimedia.org/r/913152 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [08:22:50] (03CR) 10CI reject: [V: 04-1] CentralNoticeTiming: remove central timing [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [08:24:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-rw2001.wikimedia.org - jmm@cumin2002" [08:24:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-rw2001.wikimedia.org [08:26:15] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/917818 [08:28:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-rw1001.wikimedia.org [08:28:08] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:28:15] (03CR) 10Hashar: "I have applied it to the CI Puppet master and gave it a try on integration-agent-docker-1030:" [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar) [08:29:24] (03CR) 10Jcrespo: [C: 03+1] "I confirm both proxies point to db1176/db1217 the first of which is the m5 primary and the original proxy has network traffic while the ne" [dns] - 10https://gerrit.wikimedia.org/r/917818 (owner: 10Marostegui) [08:29:35] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/917818 (owner: 10Marostegui) [08:29:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: introduce haproxy check for the BGP VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917302 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [08:30:05] !log Failover m5-master from dbproxy1021 to dbproxy1017 [08:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:50] (03CR) 10Hashar: [C: 03+1] "I have applied the patch to the CI Puppet master and ran Puppet on integration-agent-docker-1030. It is a noop as expected since this is m" [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [08:31:59] (03PS1) 10KartikMistry: Update MinT to 2023-05-09-082017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917819 (https://phabricator.wikimedia.org/T335725) [08:32:17] (03PS2) 10KartikMistry: Update MinT to 2023-05-09-082017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917819 (https://phabricator.wikimedia.org/T335725) [08:32:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:33:09] (03CR) 10Hashar: "Cherry picked on the CI Puppet master and I ran Puppet on integration-agent-docker-1030. It created the two repositories:" [puppet] - 10https://gerrit.wikimedia.org/r/914711 (owner: 10Hashar) [08:34:34] (03PS1) 10DCausse: flink-session-cluster: enable rocksdb metrics and increase jvm heap [deployment-charts] - 10https://gerrit.wikimedia.org/r/917820 (https://phabricator.wikimedia.org/T336134) [08:35:18] (03CR) 10Hashar: "I have cherry picked it on the CI Puppet master. That is a noop since the two systemd::timer::job already got absented in the previous pat" [puppet] - 10https://gerrit.wikimedia.org/r/916515 (owner: 10Hashar) [08:35:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-rw1001.wikimedia.org - jmm@cumin2002" [08:36:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-rw1001.wikimedia.org - jmm@cumin2002" [08:36:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:36:07] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-rw1001.wikimedia.org on all recursors [08:36:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-rw1001.wikimedia.org on all recursors [08:36:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-rw1001.wikimedia.org - jmm@cumin2002" [08:37:08] (03CR) 10Hashar: [C: 04-1] "I have removed this change from the CI Puppet master in order to have the systemd timers to be removed properly by the Puppet agent. Once" [puppet] - 10https://gerrit.wikimedia.org/r/916515 (owner: 10Hashar) [08:37:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-rw1001.wikimedia.org - jmm@cumin2002" [08:37:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-rw1001.wikimedia.org [08:37:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:38:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914753 (owner: 10Ayounsi) [08:38:32] (03CR) 10Ayounsi: [C: 03+2] Fix multiple pylint inconsistencies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/914753 (owner: 10Ayounsi) [08:39:41] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [08:39:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [08:40:01] !log Stop mariadb on db1115 (old zarcillo master) T334455 [08:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:05] T334455: Failover all db1115 services to db1215 - https://phabricator.wikimedia.org/T334455 [08:40:41] jouncebot: nowandnext [08:40:41] For the next 1 hour(s) and 19 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T0800) [08:40:41] In 1 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1000) [08:42:44] (03PS1) 10Muehlenhoff: Add ldap-rw[12]001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/917822 (https://phabricator.wikimedia.org/T331699) [08:42:58] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Patch-For-Review: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:43:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1003.eqiad.wmnet [08:44:55] (03PS3) 10Alexandros Kosiaris: Add machinetranslation service RRs [dns] - 10https://gerrit.wikimedia.org/r/914351 (https://phabricator.wikimedia.org/T331505) [08:44:57] (03PS12) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [08:46:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1003.eqiad.wmnet [08:47:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add machinetranslation service RRs [dns] - 10https://gerrit.wikimedia.org/r/914351 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [08:48:03] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10akosiaris) 05Open→03Resolved a:03akosiaris I am gonna tentatively resolve this in the interest of not leaving a ticket lingering open l... [08:50:42] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/917308 (owner: 10L10n-bot) [08:54:31] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-rw[12]001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/917822 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [08:55:38] (03CR) 10Jelto: [C: 03+1] "lgtm, one naming nit in line" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [08:57:24] (03PS13) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [08:57:56] (03CR) 10CI reject: [V: 04-1] Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [09:00:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2003.codfw.wmnet [09:00:24] (03PS14) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [09:00:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add configuration file support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336037 (10Joe) 05Open→03In progress p:05Triage→03High [09:00:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [09:01:33] (03PS14) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) [09:01:35] (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [09:02:44] (03CR) 10JMeybohm: [C: 04-1] miscweb: add annualreport release to miscweb (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [09:02:57] (03PS1) 10Majavah: Use toolforge_weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917824 (https://phabricator.wikimedia.org/T336057) [09:03:01] (03PS1) 10Majavah: Add logs action [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) [09:03:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2003.codfw.wmnet [09:04:15] (03CR) 10CI reject: [V: 04-1] Use toolforge_weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917824 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [09:06:00] (03CR) 10JMeybohm: [C: 04-1] "Forgot: I would have also expected Zookeeper settings and networkpolicies somewhere here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [09:08:15] because I am slow after the 3 days week-end, I am running the train now [09:08:19] for group0 promotion [09:08:23] (03PS2) 10Majavah: Use toolforge_weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917824 (https://phabricator.wikimedia.org/T336057) [09:08:25] (03PS2) 10Majavah: Add logs action [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) [09:08:27] (03PS1) 10Majavah: d/control: Do not manually list Python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917826 [09:08:31] (03CR) 10Jbond: [C: 03+2] admin: updatre permisdsions for fr-tech-admins [puppet] - 10https://gerrit.wikimedia.org/r/917358 (https://phabricator.wikimedia.org/T334154) (owner: 10Jbond) [09:09:53] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917827 (https://phabricator.wikimedia.org/T330214) [09:09:55] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917827 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [09:10:42] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917827 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [09:10:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:11:11] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10jbond) >>! In T334154#8833707, @jbond wrote: > i have created a CR wh... [09:13:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:11] (03CR) 10Arturo Borrero Gonzalez: Add logs action (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [09:15:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Use toolforge_weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917824 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [09:15:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:16:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] d/control: Do not manually list Python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917826 (owner: 10Majavah) [09:17:55] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.8 refs T330214 [09:17:59] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [09:18:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:53] (03CR) 10Peter Fischer: [C: 03+1] "LGTM! Hope that gives us some insights." [deployment-charts] - 10https://gerrit.wikimedia.org/r/917820 (https://phabricator.wikimedia.org/T336134) (owner: 10DCausse) [09:20:02] (03CR) 10Ayounsi: [C: 03+2] Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [09:20:36] (03Merged) 10jenkins-bot: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [09:21:22] (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [09:23:38] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [09:23:44] (03PS4) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) [09:23:46] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) [09:23:58] (03CR) 10Ayounsi: [C: 03+2] netbox: add validators to production host [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [09:24:07] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) p:05Triage→03Medium [09:24:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar) [09:26:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [09:26:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/914711 (owner: 10Hashar) [09:28:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [09:28:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [09:28:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:28:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T335845)', diff saved to https://phabricator.wikimedia.org/P47959 and previous config saved to /var/cache/conftool/dbconfig/20230509-092843-ladsgroup.json [09:28:45] (03CR) 10Jelto: "thanks for the review, answers inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [09:29:18] (03PS9) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [09:29:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [09:29:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [09:29:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [09:30:16] (03CR) 10Jbond: [C: 03+2] ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar) [09:30:20] (03CR) 10Jbond: [C: 03+2] ci: use an array to manage gitcache repos [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [09:30:23] (03CR) 10Jbond: [C: 03+2] ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 (owner: 10Hashar) [09:31:05] XioNoX: happy for me to merge yours [09:31:29] jbond: thx, was waiting for another cookbook but it's good now [09:31:51] XioNoX: merged [09:32:08] (03CR) 10Arturo Borrero Gonzalez: "PCC as expected: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41087/console" [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [09:32:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping3003.esams.wmnet [09:33:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [09:33:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [09:33:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T335845)', diff saved to https://phabricator.wikimedia.org/P47960 and previous config saved to /var/cache/conftool/dbconfig/20230509-093320-ladsgroup.json [09:34:14] (03CR) 10Peter Fischer: "Looks reasonable, only stumbled over the missing dashboard URL" [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [09:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T335845)', diff saved to https://phabricator.wikimedia.org/P47961 and previous config saved to /var/cache/conftool/dbconfig/20230509-093419-ladsgroup.json [09:35:48] (03CR) 10Majavah: [C: 03+2] d/control: Do not manually list Python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917826 (owner: 10Majavah) [09:35:51] (03CR) 10Majavah: [C: 03+2] Use toolforge_weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917824 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [09:36:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping3003.esams.wmnet [09:37:01] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2001-dev.wikimedia.org [09:37:13] (03CR) 10Majavah: Add logs action (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [09:37:29] (03Merged) 10jenkins-bot: d/control: Do not manually list Python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917826 (owner: 10Majavah) [09:37:31] (03Merged) 10jenkins-bot: Use toolforge_weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917824 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [09:40:45] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) [09:40:58] (03CR) 10Volans: [C: 03+1] "Questions inline, lgtm in general" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [09:41:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T335845)', diff saved to https://phabricator.wikimedia.org/P47962 and previous config saved to /var/cache/conftool/dbconfig/20230509-094100-ladsgroup.json [09:41:07] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Ladsgroup) If Manuel is out, feel free to ping me instead. [09:43:26] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: support intervals in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/917831 [09:43:46] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [09:45:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:45:50] (03CR) 10Btullis: [C: 03+1] "Looks good to me too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/917820 (https://phabricator.wikimedia.org/T336134) (owner: 10DCausse) [09:45:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: support intervals in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/917831 (owner: 10Arturo Borrero Gonzalez) [09:48:58] (03PS3) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [09:49:06] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [09:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P47964 and previous config saved to /var/cache/conftool/dbconfig/20230509-094925-ladsgroup.json [09:49:36] (03PS1) 10Marostegui: db_mysql.py: Replace zarcillo master [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) [09:50:14] (03PS5) 10Hashar: ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 [09:50:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:51:16] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10netbox: Enforce Netbox domain names without period termination - https://phabricator.wikimedia.org/T306809 (10ayounsi) 05Open→03Resolved a:03ayounsi Done using Netbox validators. [09:51:22] (03CR) 10CI reject: [V: 04-1] db_mysql.py: Replace zarcillo master [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [09:51:26] (03CR) 10Hashar: [C: 03+1] "I have confirmed via cumin that the timers have been removed from all hosts in the integration WMCS project. Success!" [puppet] - 10https://gerrit.wikimedia.org/r/916515 (owner: 10Hashar) [09:51:29] group 0 looks calm [09:51:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/917326 (owner: 10Arturo Borrero Gonzalez) [09:51:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/917327 (owner: 10Arturo Borrero Gonzalez) [09:52:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] haproxy: check_haproxy: don't use a tempfile [puppet] - 10https://gerrit.wikimedia.org/r/917326 (owner: 10Arturo Borrero Gonzalez) [09:53:08] (03CR) 10Jbond: [C: 03+2] ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 (owner: 10Hashar) [09:53:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] haproxy: check_haproxy: remove unused variables [puppet] - 10https://gerrit.wikimedia.org/r/917327 (owner: 10Arturo Borrero Gonzalez) [09:53:29] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2001-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [09:54:07] jbond: merging `ci: rm gitcache absented timers (5e212bf9b2)` in the puppetmaster [09:54:12] arturo: yes please [09:54:18] thanks [09:54:39] hashar: fyi merged [09:54:44] (03PS3) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: prevent globbing and word splitting [puppet] - 10https://gerrit.wikimedia.org/r/917328 [09:54:55] (03PS3) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: introduce new check mode --check=someup [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) [09:55:24] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2001-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [09:55:24] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:55:25] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcontrol2001-dev.wikimedia.org [09:56:05] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10ayounsi) [09:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P47965 and previous config saved to /var/cache/conftool/dbconfig/20230509-095607-ladsgroup.json [09:58:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/917328 (owner: 10Arturo Borrero Gonzalez) [09:58:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] haproxy: check_haproxy: prevent globbing and word splitting [puppet] - 10https://gerrit.wikimedia.org/r/917328 (owner: 10Arturo Borrero Gonzalez) [10:00:07] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1000) [10:00:53] (03PS1) 10Ladsgroup: tox: Bump black version [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917839 (https://phabricator.wikimedia.org/T336240) [10:00:55] (03Abandoned) 10Majavah: kubernetes: Set php7.4 as the default backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713303 (owner: 10Majavah) [10:03:41] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add traffic sampling support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336038 (10Joe) 05Open→03In progress p:05Triage→03High [10:03:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [10:03:59] (03PS1) 10Giuseppe Lavagetto: trafficserver: make mw-on-k8s use a config file [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) [10:04:03] (03PS1) 10Giuseppe Lavagetto: trafficserver: allow partial traffic flow to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) [10:04:12] (03PS2) 10Ladsgroup: tox: Bump black version [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917839 (https://phabricator.wikimedia.org/T336240) [10:04:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P47966 and previous config saved to /var/cache/conftool/dbconfig/20230509-100431-ladsgroup.json [10:05:18] (03PS1) 10Muehlenhoff: cluster::management: Add fr-tech-admins to profile::admin::groups [puppet] - 10https://gerrit.wikimedia.org/r/917842 (https://phabricator.wikimedia.org/T334154) [10:05:43] (03CR) 10Muehlenhoff: [C: 03+1] "Followup (had missed that in review): https://gerrit.wikimedia.org/r/917842" [puppet] - 10https://gerrit.wikimedia.org/r/917358 (https://phabricator.wikimedia.org/T334154) (owner: 10Jbond) [10:06:55] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: merge eqiad/codfw dmz_cidr configuration [puppet] - 10https://gerrit.wikimedia.org/r/917843 [10:08:14] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: merge eqiad/codfw dmz_cidr configuration [puppet] - 10https://gerrit.wikimedia.org/r/917843 [10:09:49] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: merge eqiad/codfw dmz_cidr configuration [puppet] - 10https://gerrit.wikimedia.org/r/917843 [10:10:45] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P47967 and previous config saved to /var/cache/conftool/dbconfig/20230509-101113-ladsgroup.json [10:11:46] (03CR) 10Jcrespo: [C: 03+1] "Looks right to me, if it works." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917839 (https://phabricator.wikimedia.org/T336240) (owner: 10Ladsgroup) [10:12:05] (03CR) 10Jbond: "lgtm but see nit." [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:13:53] (03Abandoned) 10Majavah: P:toolforge: install toolforge-logs-cli [puppet] - 10https://gerrit.wikimedia.org/r/916791 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [10:14:12] (03PS3) 10Majavah: P:toolforge: merge jobs_framework_cli to toolforge_cli [puppet] - 10https://gerrit.wikimedia.org/r/916792 [10:15:04] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: merge eqiad/codfw dmz_cidr configuration [puppet] - 10https://gerrit.wikimedia.org/r/917843 [10:15:38] (03CR) 10Jbond: [C: 03+1] "lgtm cheers" [puppet] - 10https://gerrit.wikimedia.org/r/917842 (https://phabricator.wikimedia.org/T334154) (owner: 10Muehlenhoff) [10:17:42] (03CR) 10Hashar: [C: 04-1] "The sole point of this patch was to move the LFS data from /srv/gerrit/plugins/lfs to /srv/gerrit/data/lfs (the default in Gerrit config) " [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [10:18:04] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: merge eqiad/codfw dmz_cidr configuration [puppet] - 10https://gerrit.wikimedia.org/r/917843 [10:18:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: merge jobs_framework_cli to toolforge_cli [puppet] - 10https://gerrit.wikimedia.org/r/916792 (owner: 10Majavah) [10:18:31] (Access port speed <= 100Mbps) firing: (2) Alert for device asw-b-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:19:28] (03CR) 10Muehlenhoff: [C: 03+2] cluster::management: Add fr-tech-admins to profile::admin::groups [puppet] - 10https://gerrit.wikimedia.org/r/917842 (https://phabricator.wikimedia.org/T334154) (owner: 10Muehlenhoff) [10:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T335845)', diff saved to https://phabricator.wikimedia.org/P47968 and previous config saved to /var/cache/conftool/dbconfig/20230509-101938-ladsgroup.json [10:19:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [10:19:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [10:20:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T335845)', diff saved to https://phabricator.wikimedia.org/P47969 and previous config saved to /var/cache/conftool/dbconfig/20230509-102001-ladsgroup.json [10:20:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/917843/41090/" [puppet] - 10https://gerrit.wikimedia.org/r/917843 (owner: 10Arturo Borrero Gonzalez) [10:20:36] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: merge eqiad/codfw dmz_cidr configuration [puppet] - 10https://gerrit.wikimedia.org/r/917843 (owner: 10Arturo Borrero Gonzalez) [10:21:27] (03PS1) 10Volans: validators: avoid external dep. for the DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917846 [10:21:56] (03CR) 10CI reject: [V: 04-1] validators: avoid external dep. for the DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917846 (owner: 10Volans) [10:23:57] (03PS2) 10Volans: validators: avoid external dep. for the DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917846 [10:24:29] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down. [10:24:54] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock cookbooks and netbox changes, happy to revisit later" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917846 (owner: 10Volans) [10:24:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down. [10:25:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=12105eb2-e5ac-4f19-9896-9ba53e1acd48) set by cmooney@cumin1001 f... [10:25:08] (03PS15) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) [10:25:31] (03Merged) 10jenkins-bot: validators: avoid external dep. for the DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917846 (owner: 10Volans) [10:25:55] (03PS4) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [10:26:07] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [10:26:19] !log volans@cumin1001 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling update on A:netbox-canary [10:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T335845)', diff saved to https://phabricator.wikimedia.org/P47970 and previous config saved to /var/cache/conftool/dbconfig/20230509-102619-ladsgroup.json [10:26:24] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [10:26:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [10:26:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [10:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T335845)', diff saved to https://phabricator.wikimedia.org/P47971 and previous config saved to /var/cache/conftool/dbconfig/20230509-102644-ladsgroup.json [10:26:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T335845)', diff saved to https://phabricator.wikimedia.org/P47972 and previous config saved to /var/cache/conftool/dbconfig/20230509-102652-ladsgroup.json [10:27:03] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [10:28:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:04] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2001-dev.wikimedia.org [10:30:10] (03PS3) 10Barakat Ajadi: CentralNoticeTiming: remove central timing [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) [10:32:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T335845)', diff saved to https://phabricator.wikimedia.org/P47973 and previous config saved to /var/cache/conftool/dbconfig/20230509-103209-ladsgroup.json [10:32:33] !log aborrero@cumin2002 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts cloudcontrol2001-dev.wikimedia.org [10:33:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:24] (03CR) 10Kamila Součková: "LGTM, but another set of eyes would be very welcome" [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto) [10:34:32] (03PS1) 10Volans: validators: fix ipaddress DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917850 [10:35:15] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock cookbooks and netbox changes (2), happy to amend it later" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917850 (owner: 10Volans) [10:35:46] (03Merged) 10jenkins-bot: validators: fix ipaddress DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917850 (owner: 10Volans) [10:36:09] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [10:36:19] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [10:36:36] (03PS2) 10Majavah: kubernetes: Allow configuring the toolforge.org public domain [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/916874 (https://phabricator.wikimedia.org/T257386) [10:39:09] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2001-dev.wikimedia.org [10:40:39] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 32.38 ms [10:41:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P47974 and previous config saved to /var/cache/conftool/dbconfig/20230509-104158-ladsgroup.json [10:42:09] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:42:38] (KubernetesCalicoDown) resolved: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:42:39] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [10:43:17] (03PS1) 10Muehlenhoff: Failover idp.w.o for reboot [dns] - 10https://gerrit.wikimedia.org/r/917852 [10:43:31] (Access port speed <= 100Mbps) firing: (2) Device asw-b-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:44:03] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [10:45:16] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:45:16] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudcontrol2001-dev.wikimedia.org [10:47:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P47975 and previous config saved to /var/cache/conftool/dbconfig/20230509-104715-ladsgroup.json [10:48:39] (03CR) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913199 (owner: 10EoghanGaffney) [10:48:41] (03CR) 10Marostegui: "recheck" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [10:49:15] (03PS5) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 [10:53:52] (03CR) 10Kamila Součková: [C: 03+1] "LGTM (I think)" [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto) [10:55:23] (03CR) 10Ladsgroup: [C: 03+2] tox: Bump black version [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917839 (https://phabricator.wikimedia.org/T336240) (owner: 10Ladsgroup) [10:55:33] (03PS69) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [10:56:03] PROBLEM - OSPF status on lsw1-e1-eqiad.mgmt is CRITICAL: OSPFv2: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:56:57] (03Merged) 10jenkins-bot: tox: Bump black version [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917839 (https://phabricator.wikimedia.org/T336240) (owner: 10Ladsgroup) [10:57:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P47976 and previous config saved to /var/cache/conftool/dbconfig/20230509-105704-ladsgroup.json [10:57:24] (03CR) 10Majavah: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:57:26] (03CR) 10Ladsgroup: "recheck" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:00:29] (03CR) 10Marostegui: "\o/" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:00:36] (03CR) 10Marostegui: [C: 03+2] db_mysql.py: Replace zarcillo master [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:02:02] (03Merged) 10jenkins-bot: db_mysql.py: Replace zarcillo master [software/wmfdb] - 10https://gerrit.wikimedia.org/r/917833 (https://phabricator.wikimedia.org/T334455) (owner: 10Marostegui) [11:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P47977 and previous config saved to /var/cache/conftool/dbconfig/20230509-110222-ladsgroup.json [11:02:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:03:31] (03CR) 10Krinkle: [C: 03+1] CentralNoticeTiming: remove central timing [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [11:06:08] (03PS70) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:06:58] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 (owner: 10EoghanGaffney) [11:07:23] (03PS3) 10KartikMistry: Update MinT to 2023-05-09-082017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917819 (https://phabricator.wikimedia.org/T331505) [11:07:33] (03PS71) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:08:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host ldap-rw1001.wikimedia.org with OS bullseye [11:08:19] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host ldap-rw1001.wikimedia.org with OS bullseye [11:10:20] (03PS4) 10KartikMistry: Update MinT to 2023-05-09-110213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917819 (https://phabricator.wikimedia.org/T331505) [11:10:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/916918 (https://phabricator.wikimedia.org/T336248) [11:10:51] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/916919 (https://phabricator.wikimedia.org/T336248) [11:10:54] (03PS72) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:11:03] (03Abandoned) 10Marostegui: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/916918 (https://phabricator.wikimedia.org/T336248) (owner: 10Gerrit maintenance bot) [11:11:09] (03Abandoned) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/916919 (https://phabricator.wikimedia.org/T336248) (owner: 10Gerrit maintenance bot) [11:12:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T335845)', diff saved to https://phabricator.wikimedia.org/P47978 and previous config saved to /var/cache/conftool/dbconfig/20230509-111211-ladsgroup.json [11:12:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:12:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:12:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T335845)', diff saved to https://phabricator.wikimedia.org/P47979 and previous config saved to /var/cache/conftool/dbconfig/20230509-111235-ladsgroup.json [11:12:56] (03PS1) 10Marostegui: install_server: Change the placeholder [puppet] - 10https://gerrit.wikimedia.org/r/917857 [11:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:13:34] (03CR) 10Marostegui: [C: 03+2] install_server: Change the placeholder [puppet] - 10https://gerrit.wikimedia.org/r/917857 (owner: 10Marostegui) [11:15:03] * kart_ updating MinT [11:15:18] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-09-110213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917819 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [11:15:35] 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) a:05aborrero→03Papaul [11:15:43] (03PS73) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:16:04] (03Merged) 10jenkins-bot: Update MinT to 2023-05-09-110213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917819 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [11:16:06] 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) hey @Papaul could you please physical connect this host to cloudsw1-b1-codfw instead of asw-b1-codfw? https://netbox.wikimedia.org... [11:16:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-rw1001.wikimedia.org with reason: host reimage [11:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T335845)', diff saved to https://phabricator.wikimedia.org/P47980 and previous config saved to /var/cache/conftool/dbconfig/20230509-111730-ladsgroup.json [11:17:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [11:17:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [11:17:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T335845)', diff saved to https://phabricator.wikimedia.org/P47981 and previous config saved to /var/cache/conftool/dbconfig/20230509-111755-ladsgroup.json [11:18:21] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:18:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T335845)', diff saved to https://phabricator.wikimedia.org/P47982 and previous config saved to /var/cache/conftool/dbconfig/20230509-111851-ladsgroup.json [11:19:55] (03PS3) 10KartikMistry: Update cxserver to 2023-05-08-134152-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917338 (https://phabricator.wikimedia.org/T336115) [11:19:59] (03CR) 10Jbond: "updated, fyi if you are happy it like to merge this, its already on its 73 revision and is starting to get a bit bigger then id like. It " [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:20:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-rw1001.wikimedia.org with reason: host reimage [11:20:12] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:23:53] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:24:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:25:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T335845)', diff saved to https://phabricator.wikimedia.org/P47983 and previous config saved to /var/cache/conftool/dbconfig/20230509-112535-ladsgroup.json [11:27:04] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:29:55] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:30:38] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [11:30:58] (03PS3) 10Majavah: P:toolforge: webservice: set public_domain config [puppet] - 10https://gerrit.wikimedia.org/r/916875 (https://phabricator.wikimedia.org/T257386) [11:31:20] (03PS4) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: introduce new check mode --check=someup [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) [11:31:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ldap-rw1001.wikimedia.org with OS bullseye [11:31:56] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host ldap-rw1001.wikimedia.org with OS bullseye completed: - ldap-rw1001 (**PASS**) - R... [11:33:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubernetes: Allow configuring the toolforge.org public domain [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/916874 (https://phabricator.wikimedia.org/T257386) (owner: 10Majavah) [11:33:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: webservice: set public_domain config [puppet] - 10https://gerrit.wikimedia.org/r/916875 (https://phabricator.wikimedia.org/T257386) (owner: 10Majavah) [11:33:52] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P47984 and previous config saved to /var/cache/conftool/dbconfig/20230509-113357-ladsgroup.json [11:34:08] (03Merged) 10jenkins-bot: kubernetes: Allow configuring the toolforge.org public domain [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/916874 (https://phabricator.wikimedia.org/T257386) (owner: 10Majavah) [11:34:15] (03PS74) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [11:35:13] (03CR) 10Majavah: Add logs action (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [11:36:31] !log Updated MinT to 2023-05-09-110213-production (T331505, T335725, T331505) [11:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:36] T335725: MinT: Santhali (sat) machine translation doesn't seem to output Santali - https://phabricator.wikimedia.org/T335725 [11:36:36] T331505: Self hosted machine translation service - https://phabricator.wikimedia.org/T331505 [11:38:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Add logs action (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [11:38:52] (03PS1) 10Hnowlan: svg: set LC_ALL instead of LANG [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/917861 (https://phabricator.wikimedia.org/T335361) [11:39:23] (03Merged) 10jenkins-bot: Add logs action [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917825 (https://phabricator.wikimedia.org/T336057) (owner: 10Majavah) [11:40:26] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/917861 (https://phabricator.wikimedia.org/T335361) (owner: 10Hnowlan) [11:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P47985 and previous config saved to /var/cache/conftool/dbconfig/20230509-114041-ladsgroup.json [11:42:45] (03PS1) 10Jbond: utils: terminate flags to prevent ambiguity [puppet] - 10https://gerrit.wikimedia.org/r/917862 [11:43:03] * kart_ updating cxserver now.. [11:43:12] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-08-134152-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917338 (https://phabricator.wikimedia.org/T336115) (owner: 10KartikMistry) [11:43:53] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-08-134152-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/917338 (https://phabricator.wikimedia.org/T336115) (owner: 10KartikMistry) [11:45:32] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:45:51] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P47986 and previous config saved to /var/cache/conftool/dbconfig/20230509-114903-ladsgroup.json [11:50:50] (03PS1) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) [11:51:05] (03CR) 10Ayounsi: [C: 03+1] validators: fix ipaddress DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917850 (owner: 10Volans) [11:51:10] (03CR) 10Ayounsi: [C: 03+1] validators: avoid external dep. for the DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917846 (owner: 10Volans) [11:53:08] (03PS2) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917863 (https://phabricator.wikimedia.org/T336092) [11:53:18] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:53:53] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:54:47] (03CR) 10Arturo Borrero Gonzalez: haproxy: check_haproxy: introduce new check mode --check=someup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:55:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P47987 and previous config saved to /var/cache/conftool/dbconfig/20230509-115547-ladsgroup.json [11:57:53] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:58:28] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:58:35] !log eoghan@cumin1001 START - Cookbook sre.hosts.decommission for hosts aphlict1001.eqiad.wmnet [12:02:04] !log Updated cxserver to 2023-05-08-134152-production (T336115, T335987, T331835) [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:11] T335987: Post-creation work for gpewiki - https://phabricator.wikimedia.org/T335987 [12:02:11] T336115: Post-creation work for btmwiktionary - https://phabricator.wikimedia.org/T336115 [12:02:11] T331835: Create new MT client in cxserver for self hosted MT service - https://phabricator.wikimedia.org/T331835 [12:02:26] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [12:02:38] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: drop references to the old FQDN [puppet] - 10https://gerrit.wikimedia.org/r/917866 (https://phabricator.wikimedia.org/T336236) [12:03:09] (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: drop references to the old FQDN [puppet] - 10https://gerrit.wikimedia.org/r/917866 (https://phabricator.wikimedia.org/T336236) [12:04:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T335845)', diff saved to https://phabricator.wikimedia.org/P47988 and previous config saved to /var/cache/conftool/dbconfig/20230509-120410-ladsgroup.json [12:04:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [12:04:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: drop references to the old FQDN [puppet] - 10https://gerrit.wikimedia.org/r/917866 (https://phabricator.wikimedia.org/T336236) (owner: 10Arturo Borrero Gonzalez) [12:04:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [12:04:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T335845)', diff saved to https://phabricator.wikimedia.org/P47989 and previous config saved to /var/cache/conftool/dbconfig/20230509-120433-ladsgroup.json [12:06:02] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aphlict1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [12:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T335845)', diff saved to https://phabricator.wikimedia.org/P47990 and previous config saved to /var/cache/conftool/dbconfig/20230509-121053-ladsgroup.json [12:11:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [12:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T335845)', diff saved to https://phabricator.wikimedia.org/P47991 and previous config saved to /var/cache/conftool/dbconfig/20230509-121102-ladsgroup.json [12:11:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [12:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T335845)', diff saved to https://phabricator.wikimedia.org/P47992 and previous config saved to /var/cache/conftool/dbconfig/20230509-121119-ladsgroup.json [12:13:32] 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2002.wikimedia.org (A1) - https://phabricator.wikimedia.org/T336258 (10Jelto) [12:14:49] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aphlict1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [12:14:49] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:14:50] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: enable and run partial backups daily [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [12:14:50] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aphlict1001.eqiad.wmnet [12:19:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T335845)', diff saved to https://phabricator.wikimedia.org/P47994 and previous config saved to /var/cache/conftool/dbconfig/20230509-121941-ladsgroup.json [12:19:55] (03CR) 10Jelto: miscweb: add annualreport release to miscweb (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:20:46] (03PS5) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [12:21:41] (03CR) 10Majavah: "You should add the cloudlb nodes to profile::openstack::codfw1dev::haproxy_nodes." [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [12:26:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:26:07] (03PS6) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [12:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P47995 and previous config saved to /var/cache/conftool/dbconfig/20230509-122608-ladsgroup.json [12:26:29] (03PS1) 10Matthias Mullie: Add $wgInterwikiLogoOverride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917871 (https://phabricator.wikimedia.org/T315269) [12:27:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host eventlog1003.eqiad.wmnet [12:29:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host ldap-rw2001.wikimedia.org with OS bullseye [12:29:55] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host ldap-rw2001.wikimedia.org with OS bullseye [12:30:21] (03PS1) 10Majavah: d/changelog: Prepare for release 0.96 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917872 [12:31:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1003.eqiad.wmnet [12:31:38] (03CR) 10Majavah: cloudcontrol2001-dev: introduce cloudlb support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [12:32:00] (03CR) 10JMeybohm: [C: 04-1] miscweb: add annualreport release to miscweb (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:32:02] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for release 0.96 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917872 (owner: 10Majavah) [12:33:26] (03PS5) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) [12:33:37] (03Merged) 10jenkins-bot: d/changelog: Prepare for release 0.96 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917872 (owner: 10Majavah) [12:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P47996 and previous config saved to /var/cache/conftool/dbconfig/20230509-123447-ladsgroup.json [12:35:36] (03PS1) 10EoghanGaffney: [aphlict] Remove aphlict1001 CNAME [dns] - 10https://gerrit.wikimedia.org/r/917873 (https://phabricator.wikimedia.org/T333452) [12:37:04] (03CR) 10Jelto: miscweb: add annualreport release to miscweb (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [12:41:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P47997 and previous config saved to /var/cache/conftool/dbconfig/20230509-124114-ladsgroup.json [12:41:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-rw2001.wikimedia.org with reason: host reimage [12:45:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-rw2001.wikimedia.org with reason: host reimage [12:49:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P47999 and previous config saved to /var/cache/conftool/dbconfig/20230509-124953-ladsgroup.json [12:50:10] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [12:52:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) I will send an email to @KFrancis now. I however think the following needs to happen here with my Phabricator account to bring it into compliance with WMDE Employee conventions: 1) Update my... [12:54:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) [12:54:32] (03PS75) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [12:55:21] (03CR) 10Muehlenhoff: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:56:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T335845)', diff saved to https://phabricator.wikimedia.org/P48000 and previous config saved to /var/cache/conftool/dbconfig/20230509-125620-ladsgroup.json [12:56:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:56:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:56:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T335845)', diff saved to https://phabricator.wikimedia.org/P48001 and previous config saved to /var/cache/conftool/dbconfig/20230509-125644-ladsgroup.json [12:56:48] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/failover] Rename host flags [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:57:07] (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:58:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ldap-rw2001.wikimedia.org with OS bullseye [12:58:09] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host ldap-rw2001.wikimedia.org with OS bullseye completed: - ldap-rw2001 (**PASS**) - R... [12:58:40] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1088.eqiad.wmnet with reason: Upgrading RAID controller firmware [12:58:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1088.eqiad.wmnet with reason: Upgrading RAID controller firmware [12:59:17] (03Merged) 10jenkins-bot: [gitlab/failover] Rename host flags [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:59:37] (03PS1) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1300) [13:00:07] (03CR) 10CI reject: [V: 04-1] Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:00:21] o/ I'm going to use this opportunity to push RealMe i18n to the cluster [13:00:24] nothing to do indeed. [13:00:25] (03PS2) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [13:00:34] oh, too late. go ahead taavi ! [13:00:53] (03CR) 10CI reject: [V: 04-1] Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:01:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910767 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [13:01:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync after adding ldap-rw servers - jmm@cumin2002" [13:01:51] (03Merged) 10jenkins-bot: Add RealMe to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910767 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [13:02:36] !log taavi@deploy1002 Started scap: Backport for [[gerrit:910767|Add RealMe to extension-list (T324535)]] [13:02:40] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [13:04:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T335845)', diff saved to https://phabricator.wikimedia.org/P48002 and previous config saved to /var/cache/conftool/dbconfig/20230509-130404-ladsgroup.json [13:04:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync after adding ldap-rw servers - jmm@cumin2002" [13:04:42] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:05:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T335845)', diff saved to https://phabricator.wikimedia.org/P48003 and previous config saved to /var/cache/conftool/dbconfig/20230509-130459-ladsgroup.json [13:05:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [13:05:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [13:05:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T335845)', diff saved to https://phabricator.wikimedia.org/P48004 and previous config saved to /var/cache/conftool/dbconfig/20230509-130524-ladsgroup.json [13:05:34] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:38] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Papaul) @jcrespo that is normal is it just telling you that there were some changes made on those DIMM's [13:10:25] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336213 (10Papaul) 05Open→03Resolved a:03Papaul same server same error [13:12:09] (03CR) 10Filippo Giunchedi: "LGTM overall, only minor comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [13:12:21] (03PS3) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [13:12:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T335845)', diff saved to https://phabricator.wikimedia.org/P48005 and previous config saved to /var/cache/conftool/dbconfig/20230509-131231-ladsgroup.json [13:12:48] (03CR) 10CI reject: [V: 04-1] Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:15:24] (03PS4) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [13:16:20] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Papaul) @Ladsgroup @Marostegui i am ready now if you are to take it down thanks [13:17:05] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10BTullis) [13:17:18] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10BTullis) [13:18:01] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) Give me 1h as I am in a meeting and I will get it down for you :) [13:18:31] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2002.wikimedia.org (A1) - https://phabricator.wikimedia.org/T336258 (10Papaul) a:03Jhancock.wm @Jhancock.wm can you please take care of this? thanks [13:19:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P48006 and previous config saved to /var/cache/conftool/dbconfig/20230509-131910-ladsgroup.json [13:19:45] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Papaul) a:03ssingh [13:21:31] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Aklapper) @lojo: For #2 see the box on https://www.mediawiki.org/wiki/Phabricator/Help#Using_your_Wikimedia_developer_account . I can do #1 afterwards. [13:22:38] (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M (esams) [puppet] - 10https://gerrit.wikimedia.org/r/917878 (https://phabricator.wikimedia.org/T253093) [13:23:34] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-worker1088.eqiad.wmnet [13:23:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-worker1088.eqiad.wmnet [13:23:45] (03PS76) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [13:23:47] !log taavi@deploy1002 taavi: Backport for [[gerrit:910767|Add RealMe to extension-list (T324535)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:23:51] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [13:23:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1001.eqiad.wmnet [13:24:10] (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:24:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41091/console" [puppet] - 10https://gerrit.wikimedia.org/r/917878 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [13:24:20] looks good, syncing [13:24:31] jouncebot: nowandnext [13:24:31] For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1300) [13:24:31] For the next 0 hour(s) and 35 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1300) [13:24:31] In 0 hour(s) and 35 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1400) [13:25:25] (03PS1) 10Majavah: kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 [13:25:29] (03PS1) 10Majavah: tox: Add mypy [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 [13:25:53] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [13:26:09] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10Jclark-ctr) @BTullis is server shutdown for me to replace? [13:26:44] sukhe: hi, I'm deploying a patch atm. I'm hoping it'll finish in time, it's adding some new localization files which usually takes a while [13:27:06] !log updated bookworm d-i image to 2022-05-09 daily build T330495 [13:27:10] taavi: all good! thanks [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:11] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [13:27:29] (03PS77) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [13:27:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P48007 and previous config saved to /var/cache/conftool/dbconfig/20230509-132737-ladsgroup.json [13:27:55] (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:28:08] (03PS1) 10Ssingh: hiera: add dns2004 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/917881 (https://phabricator.wikimedia.org/T330670) [13:28:15] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on an-worker1088.eqiad.wmnet with reason: Replacing RAID controller battery [13:28:18] (03PS2) 10Majavah: kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 [13:28:22] (03PS2) 10Majavah: tox: Add mypy [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 [13:28:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1001.eqiad.wmnet [13:28:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-worker1088.eqiad.wmnet with reason: Replacing RAID controller battery [13:28:38] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=907e830f-d99b-4d2e-8752-2c13c8385200) set by btullis@cumin1001 for 4:00:00 on 1 host(s) and th... [13:28:50] (03CR) 10David Caro: [C: 03+1] kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 (owner: 10Majavah) [13:29:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) Excellent. I've linked them. Thank you! [13:29:01] (03CR) 10CI reject: [V: 04-1] kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 (owner: 10Majavah) [13:29:14] (03CR) 10CI reject: [V: 04-1] tox: Add mypy [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 (owner: 10Majavah) [13:29:25] (03PS3) 10Majavah: kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 [13:29:29] (03PS3) 10Majavah: tox: Add mypy [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 [13:29:42] (03CR) 10Majavah: [C: 03+2] kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 (owner: 10Majavah) [13:29:58] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2002.wikimedia.org (A1) - https://phabricator.wikimedia.org/T336258 (10Jhancock.wm) 05Open→03Resolved @Papaul 2 x 1.92 TB drives inserted into gitlab2002 [13:30:02] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10BTullis) @Jclark-ctr Many thanks. I have shut down the machine now. Please feel free to boot it once you've finished, as it should rejoin the cluster... [13:30:25] (03Merged) 10jenkins-bot: kubernetes: fix request_stop() [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917879 (owner: 10Majavah) [13:30:59] (03PS1) 10Ssingh: lvs2008: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/917882 (https://phabricator.wikimedia.org/T335777) [13:31:32] (03CR) 10Vgutierrez: [C: 03+1] varnish: bump size of varnish shared memory log to 160M (esams) [puppet] - 10https://gerrit.wikimedia.org/r/917878 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [13:31:42] (03PS1) 10Majavah: d/changelog: Prepare for release 0.97 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917883 [13:31:55] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for release 0.97 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917883 (owner: 10Majavah) [13:33:02] (03PS4) 10Majavah: toolforge: wmcs-package-build: support .git suffix in URLs [puppet] - 10https://gerrit.wikimedia.org/r/916787 [13:33:04] (03PS4) 10Majavah: toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 [13:33:33] (03Merged) 10jenkins-bot: d/changelog: Prepare for release 0.97 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917883 (owner: 10Majavah) [13:34:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P48008 and previous config saved to /var/cache/conftool/dbconfig/20230509-133416-ladsgroup.json [13:34:28] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs2008 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/917885 (https://phabricator.wikimedia.org/T335777) [13:34:49] (03PS15) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [13:35:28] (03CR) 10CI reject: [V: 04-1] Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:36:02] PROBLEM - puppet last run on prometheus2005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:36:48] (03PS16) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [13:36:55] (03PS78) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [13:37:36] PROBLEM - puppet last run on prometheus4002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:38:06] PROBLEM - puppet last run on prometheus3002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:38:13] (03CR) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:38:24] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:910767|Add RealMe to extension-list (T324535)]] (duration: 35m 47s) [13:38:27] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [13:38:27] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: bump size of varnish shared memory log to 160M (esams) [puppet] - 10https://gerrit.wikimedia.org/r/917878 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [13:38:36] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) [13:38:50] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:40:05] (03PS1) 10Andrew Bogott: git-sync-upstream: use --rebase-merges instead of --preserve-merges [puppet] - 10https://gerrit.wikimedia.org/r/917888 (https://phabricator.wikimedia.org/T336263) [13:40:25] (03PS2) 10Majavah: Add $wmgUseRealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910768 (https://phabricator.wikimedia.org/T324535) [13:40:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910768 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [13:40:55] the prometheus/puppet alerts is me btw [13:41:16] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2002.wikimedia.org (A1) - https://phabricator.wikimedia.org/T336258 (10Jelto) Thanks for the quick response! I can confirm, disks are available. Thank you! [13:41:32] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet - https://phabricator.wikimedia.org/T336261 (10Jclark-ctr) 05Open→03Resolved @btullis raid battery has been replaced and is booting up now [13:41:33] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [13:41:35] (03Merged) 10jenkins-bot: Add $wmgUseRealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910768 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [13:41:36] RECOVERY - puppet last run on prometheus2005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:42:04] !log taavi@deploy1002 Started scap: Backport for [[gerrit:910768|Add $wmgUseRealMe (T324535)]] [13:42:12] (03CR) 10Volans: "some questions inline looks ok otherwise" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:42:28] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10Jhancock.wm) @Papaul The port being used is xe-1/0/25 [13:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P48009 and previous config saved to /var/cache/conftool/dbconfig/20230509-134244-ladsgroup.json [13:42:54] !log sudo cumin -b1 -s1200 'A:cp and A:esams' 'varnish-frontend-restart: T253093 [13:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:58] T253093: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 [13:43:10] RECOVERY - puppet last run on prometheus4002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:43:38] !log taavi@deploy1002 taavi: Backport for [[gerrit:910768|Add $wmgUseRealMe (T324535)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:43:41] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [13:43:42] RECOVERY - puppet last run on prometheus3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:44:35] !log rearmed keyholder on netmon* post reboot [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:39] (KeyholderUnarmed) resolved: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:49:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2180 T336031', diff saved to https://phabricator.wikimedia.org/P48010 and previous config saved to /var/cache/conftool/dbconfig/20230509-134921-root.json [13:49:25] T336031: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 [13:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T335845)', diff saved to https://phabricator.wikimedia.org/P48011 and previous config saved to /var/cache/conftool/dbconfig/20230509-134929-ladsgroup.json [13:49:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:49:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T335845)', diff saved to https://phabricator.wikimedia.org/P48012 and previous config saved to /var/cache/conftool/dbconfig/20230509-134952-ladsgroup.json [13:49:54] (03PS1) 10Marostegui: db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917890 (https://phabricator.wikimedia.org/T336031) [13:49:56] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:910768|Add $wmgUseRealMe (T324535)]] (duration: 07m 51s) [13:50:00] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [13:50:10] * taavi done [13:50:53] (03CR) 10Marostegui: [C: 03+2] db2180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/917890 (https://phabricator.wikimedia.org/T336031) (owner: 10Marostegui) [13:51:44] (03CR) 10Andrew Bogott: [C: 03+2] git-sync-upstream: use --rebase-merges instead of --preserve-merges [puppet] - 10https://gerrit.wikimedia.org/r/917888 (https://phabricator.wikimedia.org/T336263) (owner: 10Andrew Bogott) [13:51:47] taavi: thanks! [13:51:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) [13:51:52] no more deployments for this period? [13:51:54] jouncebot: now [13:51:54] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1300) [13:51:54] For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1300) [13:52:08] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) a:03Papaul @Papaul db2180 is all yours. [13:56:42] (03PS79) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [13:57:17] (03CR) 10CI reject: [V: 04-1] puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:57:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T335845)', diff saved to https://phabricator.wikimedia.org/P48013 and previous config saved to /var/cache/conftool/dbconfig/20230509-135750-ladsgroup.json [13:57:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [13:58:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [13:58:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T335845)', diff saved to https://phabricator.wikimedia.org/P48014 and previous config saved to /var/cache/conftool/dbconfig/20230509-135815-ladsgroup.json [13:59:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T335845)', diff saved to https://phabricator.wikimedia.org/P48015 and previous config saved to /var/cache/conftool/dbconfig/20230509-135915-ladsgroup.json [13:59:58] (03PS17) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [14:00:02] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:00:05] sukhe: OwO what's this, a deployment window?? LVS maintenance. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1400). nyaa~ [14:00:33] (03CR) 10Jbond: [C: 03+2] utils: terminate flags to prevent ambiguity [puppet] - 10https://gerrit.wikimedia.org/r/917862 (owner: 10Jbond) [14:00:38] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:01:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [14:03:28] (03CR) 10David Caro: [C: 03+1] "LGTM 🎉" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 (owner: 10Majavah) [14:05:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T335845)', diff saved to https://phabricator.wikimedia.org/P48016 and previous config saved to /var/cache/conftool/dbconfig/20230509-140535-ladsgroup.json [14:05:48] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [14:07:08] (03CR) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [14:08:23] bgp alerts in codfw and lvs2008 alerts expected [14:08:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:46] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 [14:08:50] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [14:09:15] (03CR) 10Majavah: [C: 03+2] tox: Add mypy [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 (owner: 10Majavah) [14:10:40] (03Merged) 10jenkins-bot: tox: Add mypy [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/917880 (owner: 10Majavah) [14:11:36] PROBLEM - pybal on lvs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:12:12] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:12:20] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:23] all expected [14:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:14:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P48017 and previous config saved to /var/cache/conftool/dbconfig/20230509-141421-ladsgroup.json [14:14:22] PROBLEM - PyBal connections to etcd on lvs2008 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [14:14:43] (03PS18) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [14:15:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2005.codfw.wmnet with OS bookworm [14:15:21] !log set routing-options static route 208.80.153.240/28 next-hop 10.192.49.7 [move static route for high-traffic2 to lvs2010]: T335777 [14:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:25] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [14:16:24] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:18:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [14:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P48018 and previous config saved to /var/cache/conftool/dbconfig/20230509-142044-ladsgroup.json [14:22:50] (03CR) 10Ebernhardson: search: Add alert based on age of titlesuggest indices (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [14:23:15] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2180'] [14:24:00] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) Hostname is the same, but domain is changing, from `cloudcontrol2001-dev.wikimedia.org` to `cloudcontrol2001-dev.codfw.wmn... [14:24:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2180'] [14:24:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] haproxy: check_haproxy: introduce new check mode --check=someup [puppet] - 10https://gerrit.wikimedia.org/r/917329 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [14:25:50] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:26:16] (03PS19) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [14:26:18] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) 05Open→03Resolved [14:26:43] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:27:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:08] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:28:32] (Access port speed <= 100Mbps) firing: (2) Alert for device asw-c-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:29:00] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host backup1011 [14:29:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1011 [14:29:15] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host backup1010 [14:29:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1010 [14:29:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P48019 and previous config saved to /var/cache/conftool/dbconfig/20230509-142927-ladsgroup.json [14:31:26] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) >>! In T336236#8837014, @Jhancock.wm wrote: > @Papaul The port being used is xe-1/0/25 In case is relevant, before the de... [14:32:03] !log decommission lvs2008 [14:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:10] (03CR) 10Bking: [C: 03+2] flink-session-cluster: enable rocksdb metrics and increase jvm heap [deployment-charts] - 10https://gerrit.wikimedia.org/r/917820 (https://phabricator.wikimedia.org/T336134) (owner: 10DCausse) [14:32:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs2008.codfw.wmnet [14:33:05] (03Merged) 10jenkins-bot: flink-session-cluster: enable rocksdb metrics and increase jvm heap [deployment-charts] - 10https://gerrit.wikimedia.org/r/917820 (https://phabricator.wikimedia.org/T336134) (owner: 10DCausse) [14:33:47] (03PS16) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) [14:35:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P48020 and previous config saved to /var/cache/conftool/dbconfig/20230509-143550-ladsgroup.json [14:36:25] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:37:20] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) @Marostegui I can do it today in the next two hours or tomorrow after 21:00 UTC [14:37:38] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:37:53] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:41:49] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:41:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [14:42:26] (03CR) 10Filippo Giunchedi: [C: 03+2] coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [14:42:29] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [14:42:55] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) The process to migrate involves a rsync running a chroot which is thus unable to do user/group id mapping between the h... [14:43:34] arturo: merging your change too [14:43:49] godog: yes, thanks [14:43:56] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) 05Resolved→03Open That patch gets us the much-more-helpful ` fatal: error: cannot combine '--rebase-merges' with '--strategy-option' ` [14:44:18] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:44:32] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10jcrespo) @Jhancock.wm host is now down, downtimed for 24 hours. [14:44:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T335845)', diff saved to https://phabricator.wikimedia.org/P48021 and previous config saved to /var/cache/conftool/dbconfig/20230509-144433-ladsgroup.json [14:44:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:44:41] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:44:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:44:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T335845)', diff saved to https://phabricator.wikimedia.org/P48022 and previous config saved to /var/cache/conftool/dbconfig/20230509-144457-ladsgroup.json [14:44:59] (03PS1) 10Andrew Bogott: git-sync-upstream.py: remove --rebase-merges git flag [puppet] - 10https://gerrit.wikimedia.org/r/917896 (https://phabricator.wikimedia.org/T336263) [14:45:32] (03PS12) 10Filippo Giunchedi: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [14:45:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:45:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs2008.codfw.wmnet [14:45:50] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2008.codfw.wmnet` - lvs2008.codfw.wmnet (**WARN**) - Downtimed ho... [14:46:10] (03CR) 10Majavah: [C: 03+1] git-sync-upstream.py: remove --rebase-merges git flag [puppet] - 10https://gerrit.wikimedia.org/r/917896 (https://phabricator.wikimedia.org/T336263) (owner: 10Andrew Bogott) [14:46:22] (03CR) 10Dzahn: [C: 03+1] miscweb annualreport: update redirect for 2022 report [puppet] - 10https://gerrit.wikimedia.org/r/917814 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [14:46:29] (03CR) 10Ssingh: [C: 03+2] lvs2008: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/917882 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:46:35] (03CR) 10Herron: [C: 03+1] Combine linkrecommendation SLO metrics into one cross-datacenter value [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/916680 (https://phabricator.wikimedia.org/T278083) (owner: 10RLazarus) [14:46:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2180'] [14:47:08] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) [14:47:11] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs2008 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/917885 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:47:15] (03CR) 10Andrew Bogott: [C: 03+2] git-sync-upstream.py: remove --rebase-merges git flag [puppet] - 10https://gerrit.wikimedia.org/r/917896 (https://phabricator.wikimedia.org/T336263) (owner: 10Andrew Bogott) [14:47:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [14:48:47] (03PS1) 10Jbond: puppetmaster: refactor to work with no root user [puppet] - 10https://gerrit.wikimedia.org/r/917897 (https://phabricator.wikimedia.org/T152059) [14:48:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:49:22] (03CR) 10CI reject: [V: 04-1] puppetmaster: refactor to work with no root user [puppet] - 10https://gerrit.wikimedia.org/r/917897 (https://phabricator.wikimedia.org/T152059) (owner: 10Jbond) [14:49:38] (03CR) 10Btullis: Create scap deployment source for product analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912834 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [14:49:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:12] !log homer "cr*-codfw*" commit "Gerrit: 917885 remove decommissioned host lvs2008" [14:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T335845)', diff saved to https://phabricator.wikimedia.org/P48023 and previous config saved to /var/cache/conftool/dbconfig/20230509-145057-ladsgroup.json [14:51:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:51:17] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) @jcrespo thanks! I've moved A6 to b6. I guess we put it back in and see if it throws another error. [14:51:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:51:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:51:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:51:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T335845)', diff saved to https://phabricator.wikimedia.org/P48024 and previous config saved to /var/cache/conftool/dbconfig/20230509-145128-ladsgroup.json [14:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T335845)', diff saved to https://phabricator.wikimedia.org/P48025 and previous config saved to /var/cache/conftool/dbconfig/20230509-145133-ladsgroup.json [14:51:36] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [14:52:08] (03CR) 10Hnowlan: [C: 03+2] svg: set LC_ALL instead of LANG [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/917861 (https://phabricator.wikimedia.org/T335361) (owner: 10Hnowlan) [14:52:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:46] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [14:53:44] (03PS20) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [14:54:32] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 45m 45s) [14:54:36] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [14:55:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:55:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:55:46] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [14:56:23] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:26] (03Merged) 10jenkins-bot: svg: set LC_ALL instead of LANG [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/917861 (https://phabricator.wikimedia.org/T335361) (owner: 10Hnowlan) [14:56:40] (03PS6) 10Jsn.sherman: WIP: Log additional click events on Special:Diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [14:57:25] (03PS7) 10Jsn.sherman: Log additional click events on Special:Diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [14:57:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T335845)', diff saved to https://phabricator.wikimedia.org/P48026 and previous config saved to /var/cache/conftool/dbconfig/20230509-145752-ladsgroup.json [14:58:03] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) [14:58:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:32] (Access port speed <= 100Mbps) firing: (2) Device asw-c-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [15:03:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2180'] [15:05:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm) @Papaul onboard port 1 cable ID: 12110 onboard port 2 cable ID: 12109 NIC port 1 cable ID: 12108 NIC port 2 cable ID: 12174 [15:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P48027 and previous config saved to /var/cache/conftool/dbconfig/20230509-150639-ladsgroup.json [15:06:52] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10Papaul) [15:06:56] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Papaul) 05Open→03Resolved @Marostegui Firmware upgrade for BIOS and IDRAC complete [15:08:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:09:56] (03PS1) 10Majavah: hieradata: update restricted bastion url [puppet] - 10https://gerrit.wikimedia.org/r/917902 [15:11:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host testvm2005.codfw.wmnet with OS bookworm [15:12:00] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10Papaul) a:05Papaul→03aborrero @aborrero the move from old switch to new switch is complete. [15:12:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P48028 and previous config saved to /var/cache/conftool/dbconfig/20230509-151258-ladsgroup.json [15:13:07] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) Some are human users that have uid reserved via `modules/admin/data/data.yaml`. The `deploy-*` users are created by Pup... [15:13:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:14:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:15:12] (03PS7) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) [15:17:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:18:43] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol: allow services to be contacted by all cloudlb HAproxy [puppet] - 10https://gerrit.wikimedia.org/r/917904 (https://phabricator.wikimedia.org/T332153) [15:19:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for lvs2012 - pt1979@cumin2002" [15:20:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [15:20:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for lvs2012 - pt1979@cumin2002" [15:20:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:31] (03CR) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: introduce cloudlb support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [15:21:02] jouncebot: next [15:21:02] In 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1600) [15:21:10] jouncebot: nowandnext [15:21:10] For the next 0 hour(s) and 38 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1400) [15:21:10] In 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1600) [15:21:27] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) 05Resolved→03Open @Papaul can you double check the host? It is still not accessible. [15:21:30] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10Marostegui) [15:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P48029 and previous config saved to /var/cache/conftool/dbconfig/20230509-152145-ladsgroup.json [15:22:03] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [15:23:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs2012.mgmt.codfw.wmnet with reboot policy FORCED [15:23:15] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: introduce cloudlb support [puppet] - 10https://gerrit.wikimedia.org/r/899614 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [15:24:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:25:41] (03PS8) 10Jsn.sherman: beta: log additional click events on Special:Diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [15:26:04] (03CR) 10Scardenasmolinar: [C: 03+1] beta: log additional click events on Special:Diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman) [15:26:32] (03PS1) 10Ottomata: flink-operator - disable HA replicas for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/917907 (https://phabricator.wikimedia.org/T336185) [15:26:40] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:27:07] (03CR) 10Majavah: "this is used just for firewall rules, you could have both setups in there for now" [puppet] - 10https://gerrit.wikimedia.org/r/917904 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [15:27:21] (03PS1) 10Hashar: ci: in /srv only migrate /srv/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/917908 (https://phabricator.wikimedia.org/T324659) [15:28:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P48030 and previous config saved to /var/cache/conftool/dbconfig/20230509-152804-ladsgroup.json [15:28:41] (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol: allow services to be contacted by all cloudlb HAproxy [puppet] - 10https://gerrit.wikimedia.org/r/917904 (https://phabricator.wikimedia.org/T332153) [15:28:45] (03CR) 10Hashar: "When checking uid/gid being used on contint* hosts, I have found that we do not have to rsync the whole of `/srv` but just `/srv/jenkins` " [puppet] - 10https://gerrit.wikimedia.org/r/917908 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:30:16] (03PS1) 10Arturo Borrero Gonzalez: wikimedia.cloud: add entry for cloudcontrol2001-dev [dns] - 10https://gerrit.wikimedia.org/r/917910 (https://phabricator.wikimedia.org/T336236) [15:30:30] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) + @jnuche who co manages our Jenkins nowadays. This task is to migrate the Jenkins/Zuul/integration website services f... [15:30:45] (03PS10) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [15:31:49] (03CR) 10Ottomata: [C: 03+2] flink-operator - disable HA replicas for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/917907 (https://phabricator.wikimedia.org/T336185) (owner: 10Ottomata) [15:32:08] (03CR) 10Arturo Borrero Gonzalez: cloudcontrol: allow services to be contacted by all cloudlb HAproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917904 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [15:33:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [15:34:00] (03Merged) 10jenkins-bot: flink-operator - disable HA replicas for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/917907 (https://phabricator.wikimedia.org/T336185) (owner: 10Ottomata) [15:35:13] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/917910 (https://phabricator.wikimedia.org/T336236) (owner: 10Arturo Borrero Gonzalez) [15:35:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimedia.cloud: add entry for cloudcontrol2001-dev [dns] - 10https://gerrit.wikimedia.org/r/917910 (https://phabricator.wikimedia.org/T336236) (owner: 10Arturo Borrero Gonzalez) [15:36:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T335845)', diff saved to https://phabricator.wikimedia.org/P48031 and previous config saved to /var/cache/conftool/dbconfig/20230509-153651-ladsgroup.json [15:36:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:37:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T335845)', diff saved to https://phabricator.wikimedia.org/P48032 and previous config saved to /var/cache/conftool/dbconfig/20230509-153715-ladsgroup.json [15:38:32] (Access port speed <= 100Mbps) firing: (2) Alert for device asw-c-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [15:38:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [15:40:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:42:20] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudcontrol2001-dev - pt1979@cumin2002" [15:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T335845)', diff saved to https://phabricator.wikimedia.org/P48033 and previous config saved to /var/cache/conftool/dbconfig/20230509-154313-ladsgroup.json [15:43:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [15:43:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudcontrol2001-dev - pt1979@cumin2002" [15:43:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [15:43:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T335845)', diff saved to https://phabricator.wikimedia.org/P48034 and previous config saved to /var/cache/conftool/dbconfig/20230509-154338-ladsgroup.json [15:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T335845)', diff saved to https://phabricator.wikimedia.org/P48035 and previous config saved to /var/cache/conftool/dbconfig/20230509-154346-ladsgroup.json [15:44:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:13] (03CR) 10RLazarus: [V: 03+2 C: 03+2] Combine linkrecommendation SLO metrics into one cross-datacenter value [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/916680 (https://phabricator.wikimedia.org/T278083) (owner: 10RLazarus) [15:48:00] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts2001.codfw.wmnet with reason: Re-image w/ Bullseye [15:48:23] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts2001.codfw.wmnet with reason: Re-image w/ Bullseye [15:48:50] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:09] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reimage for host vrts2001.codfw.wmnet with OS bullseye [15:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T335845)', diff saved to https://phabricator.wikimedia.org/P48036 and previous config saved to /var/cache/conftool/dbconfig/20230509-155102-ladsgroup.json [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:45] jouncebot: nowandnext [15:54:45] For the next 0 hour(s) and 5 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1400) [15:54:45] In 0 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1600) [15:55:00] (03PS1) 10Cathal Mooney: Explicitly enable Netconf over SSH in default system services [homer/public] - 10https://gerrit.wikimedia.org/r/917914 (https://phabricator.wikimedia.org/T333316) [15:56:21] (03PS1) 10RLazarus: alerting_host: Disable vopsbot in #wikimedia-sre [puppet] - 10https://gerrit.wikimedia.org/r/917915 [15:56:48] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:57:02] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:57:03] (03PS1) 10Hashar: admin: reserve jenkins and zuul uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/917916 (https://phabricator.wikimedia.org/T324659) [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P48037 and previous config saved to /var/cache/conftool/dbconfig/20230509-155852-ladsgroup.json [15:59:09] (03PS2) 10RLazarus: alerting_host: Disable vopsbot in #wikimedia-sre [puppet] - 10https://gerrit.wikimedia.org/r/917915 (https://phabricator.wikimedia.org/T329791) [16:00:03] (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/917917 (https://phabricator.wikimedia.org/T335361) [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1600). [16:00:05] haak: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:36] haak: hi! looking [16:01:39] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts2001.codfw.wmnet with reason: host reimage [16:01:54] hi! [16:02:34] haak: do you need the kubernetes and puppet deployments to be ordered in any particular way? [16:03:10] (03CR) 10Jbond: [C: 03+2] "thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/917902 (owner: 10Majavah) [16:03:17] rzl I suppose change will be available after both are deployed, so no particular order needed [16:04:18] okay, got it [16:04:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/917914 (https://phabricator.wikimedia.org/T333316) (owner: 10Cathal Mooney) [16:04:43] let's do puppet first then -- you'll be able to test on mwdebug before I roll out to the rest of the fleet, yeah? [16:04:51] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts2001.codfw.wmnet with reason: host reimage [16:05:00] rzl sure, sounds good! [16:05:17] (03CR) 10Kamila Součková: [C: 03+1] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/917917 (https://phabricator.wikimedia.org/T335361) (owner: 10Hnowlan) [16:06:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P48038 and previous config saved to /var/cache/conftool/dbconfig/20230509-160608-ladsgroup.json [16:06:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2012.mgmt.codfw.wmnet with reboot policy FORCED [16:07:23] !log stopping puppet on appservers - T230382 [16:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:27] T230382: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 [16:08:43] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/917917 (https://phabricator.wikimedia.org/T335361) (owner: 10Hnowlan) [16:08:45] (03CR) 10RLazarus: [C: 03+2] Handle Canonical URL for EntitySchemas [puppet] - 10https://gerrit.wikimedia.org/r/912327 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [16:08:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:08:54] !log jnuche@deploy1002 Installing scap version "4.52.1" for 593 hosts [16:09:00] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:09:06] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:09:28] (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/917917 (https://phabricator.wikimedia.org/T335361) (owner: 10Hnowlan) [16:10:30] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:10:36] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:11:30] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:11:39] haak: deployed to mwdebug1001, go ahead and test please [16:12:02] rzl sure, let me test [16:12:16] (03CR) 10Cathal Mooney: [C: 03+2] Explicitly enable Netconf over SSH in default system services [homer/public] - 10https://gerrit.wikimedia.org/r/917914 (https://phabricator.wikimedia.org/T333316) (owner: 10Cathal Mooney) [16:12:32] (03PS1) 10Hashar: zuul: switch to fixed uid/gid 923 [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) [16:12:48] (03Merged) 10jenkins-bot: Explicitly enable Netconf over SSH in default system services [homer/public] - 10https://gerrit.wikimedia.org/r/917914 (https://phabricator.wikimedia.org/T333316) (owner: 10Cathal Mooney) [16:13:00] (03CR) 10CI reject: [V: 04-1] zuul: switch to fixed uid/gid 923 [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [16:13:20] rzl I have the extension and everything but seems like nothing changed [16:13:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P48039 and previous config saved to /var/cache/conftool/dbconfig/20230509-161358-ladsgroup.json [16:14:16] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) a:05aborrero→03Papaul I got this @Papaul: `lang=shell-session aborrero@cumin2002:~ 1 $ sudo cookbook sre.hosts.reimage --os bullseye --new... [16:14:22] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:14:47] haak: the puppet diff looks correct, let me double-check apache is actually using the new config [16:15:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-coord1002.eqiad.wmnet [16:15:10] rzl sure, thanks a lot! [16:15:39] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:17:39] haak: in the meantime, to verify the extension's doing the right thing, will you check the HTTP headers on your page view and make sure you have "server: mwdebug1001.eqiad.wmnet" [16:18:16] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org reporting "You must GET the form before submitting it" for all list subscription attempts - https://phabricator.wikimedia.org/T185222 (10Aklapper) [16:19:06] rzl just checked, header seems correct [16:19:10] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:19:36] thanks [16:20:30] apache really shouldn't need a restart for this config change, but I can restart it just to rule out anything funny -- otherwise it smells like a bug in the config to me [16:21:11] what url are you hitting? I don't have the full backstory on the change but I can at least take a look [16:21:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P48041 and previous config saved to /var/cache/conftool/dbconfig/20230509-162115-ladsgroup.json [16:21:26] rzl https://www.wikidata.org/entity/E10 [16:21:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1002.eqiad.wmnet [16:21:51] it should redirect to https://www.wikidata.org/wiki/EntitySchema:E10 [16:22:30] I checked the config again and seems correct to me, it'd be great if you can restart Apache just to rule out anything funny as you suggested [16:22:31] (03CR) 10Hashar: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [16:22:36] (03CR) 10CI reject: [V: 04-1] zuul: switch to fixed uid/gid 923 [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [16:22:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2012'] [16:22:45] yeah and instead it goes to https://www.wikidata.org/wiki/Special:EntityData/E10, I agree that seems like it ought to work [16:22:47] restarting [16:22:56] thanks! [16:23:26] !log rzl@mwdebug1001:~$ sudo apache2ctl restart [16:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:29] now it works! [16:23:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2012'] [16:24:02] yeah! okay, interesting -- apache didn't pick up the new config, for whatever reason [16:24:23] so we can deploy the puppet change everywhere, but if that remains true, it'll need a rolling apache restart as well [16:24:27] it has its reasons, always haha [16:25:05] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2012'] [16:25:06] seems like it'll need yeah [16:26:05] so, in order to do this gradually, I'm going to re-enable puppet and let the change roll out on its usual staggered schedule, over the next 30 minutes [16:26:28] after that I'll do a rolling apache restart -- that'll have to run very gradually too, for safety reasons [16:26:54] sure, no need to rush [16:26:55] obviously we'll be off the end of the puppet request window by then, but the next thing on the schedule is the MW infrastructure window and we can run into that [16:26:59] (03PS3) 10Hashar: zuul: switch to fixed uid/gid 923 [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) [16:27:21] !log resumed puppet on appservers - T230382 [16:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:26] rzl I'm available as long as you need [16:27:26] T230382: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 [16:27:41] ah man I pasted the wrong task number both times [16:27:44] oh well [16:27:53] I'll fix it in the SAL, it'll just have updated the wrong tasks [16:28:34] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [16:28:57] let me know if I need to do anything [16:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T335845)', diff saved to https://phabricator.wikimedia.org/P48042 and previous config saved to /var/cache/conftool/dbconfig/20230509-162904-ladsgroup.json [16:29:06] 10SRE, 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) [16:29:23] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10MW-1.40-notes (1.40.0-wmf.24; 2023-02-20): Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10RLazarus) Disregard those last two SAL entries, wrong task number. :) [16:29:29] haak: nope all good -- I'll ping you when it's live [16:29:52] rzl great, thank you for your help! [16:30:06] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:30:11] thank you! [16:30:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [16:31:38] (03CR) 10David Caro: [C: 03+2] hieradata: use port 443 for enc access on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/874894 (owner: 10Majavah) [16:32:09] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudcontrol2001-dev - pt1979@cumin2002" [16:33:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2012'] [16:33:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudcontrol2001-dev - pt1979@cumin2002" [16:33:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2001-dev.codfw.wmnet with OS bullseye [16:33:56] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudbackup2001-dev.codfw.wmnet with OS bullseye [16:35:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:35:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:36:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T335845)', diff saved to https://phabricator.wikimedia.org/P48043 and previous config saved to /var/cache/conftool/dbconfig/20230509-163621-ladsgroup.json [16:36:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:36:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:36:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T335845)', diff saved to https://phabricator.wikimedia.org/P48044 and previous config saved to /var/cache/conftool/dbconfig/20230509-163646-ladsgroup.json [16:37:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:37:45] (03PS4) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [16:38:00] (03CR) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [16:38:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:39:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:18] (03PS1) 10Ssingh: lvs2012: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/917922 (https://phabricator.wikimedia.org/T326767) [16:41:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:42:56] (03PS1) 10Majavah: P:toolforge::proxy: uninstall toolsweblogster [puppet] - 10https://gerrit.wikimedia.org/r/917923 [16:43:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T335845)', diff saved to https://phabricator.wikimedia.org/P48045 and previous config saved to /var/cache/conftool/dbconfig/20230509-164307-ladsgroup.json [16:44:33] (03PS1) 10Ssingh: sites.yaml: add new LVS host lvs2012 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/917924 (https://phabricator.wikimedia.org/T326767) [16:45:19] (03PS2) 10Majavah: P:toolforge::proxy: uninstall toolsweblogster [puppet] - 10https://gerrit.wikimedia.org/r/917923 [16:46:43] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup2001-dev.codfw.wmnet with OS bullseye [16:46:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [16:46:50] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudbackup2001-dev.codfw.wmnet with OS bullseye executed wi... [16:48:20] (03PS3) 10Majavah: P:toolforge::proxy: uninstall toolsweblogster [puppet] - 10https://gerrit.wikimedia.org/r/917923 [16:49:58] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41094/console" [puppet] - 10https://gerrit.wikimedia.org/r/917923 (owner: 10Majavah) [16:51:04] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:51:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:52:42] (03PS1) 10Urbanecm: [Growth] Add mediawiki.mentor_dashboard.personalized_praise stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917925 [16:53:38] (03PS1) 10Ssingh: hiera: remove BGP MED override for lvs2012 [puppet] - 10https://gerrit.wikimedia.org/r/917926 (https://phabricator.wikimedia.org/T326767) [16:54:06] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudcontrol2001-dev - pt1979@cumin2002" [16:54:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [16:54:40] jouncebot: nowandnext [16:54:41] For the next 0 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1600) [16:54:41] In 0 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1700) [16:55:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for cloudcontrol2001-dev - pt1979@cumin2002" [16:55:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:55:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [16:56:12] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [16:57:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [16:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P48046 and previous config saved to /var/cache/conftool/dbconfig/20230509-165813-ladsgroup.json [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1700) [17:00:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo [17:00:28] !log brett@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on A:cp-text_ulsfo [17:00:33] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo [17:08:24] (03PS13) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [17:08:28] (03PS1) 10Papaul: Update cloudcontrol2001 in netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/917928 (https://phabricator.wikimedia.org/T336236) [17:08:39] (03CR) 10CI reject: [V: 04-1] Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry) [17:09:17] (03CR) 10Bking: [C: 03+2] airflow: decommission an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/917343 (https://phabricator.wikimedia.org/T333697) (owner: 10Bking) [17:09:34] (03CR) 10Papaul: [C: 03+2] Update cloudcontrol2001 in netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/917928 (https://phabricator.wikimedia.org/T336236) (owner: 10Papaul) [17:11:42] !log rolling restart apache on codfw appservers T225778 [17:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:46] T225778: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 [17:12:29] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P48047 and previous config saved to /var/cache/conftool/dbconfig/20230509-171320-ladsgroup.json [17:13:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:13:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2012.codfw.wmnet with OS bullseye [17:13:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye completed... [17:17:58] !log rolling restart apache on eqiad appservers T225778 [17:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:01] T225778: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 [17:19:28] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [17:20:54] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo [17:24:17] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) @Dwisehaupt @Jgreen It looks like this should be resolved now.... [17:24:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:25:06] (03PS1) 10Xcollazo: Don't expose sensitive config to Airflow UI users. [puppet] - 10https://gerrit.wikimedia.org/r/917929 (https://phabricator.wikimedia.org/T315450) [17:25:34] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dwisehaupt) @Dzahn Thanks. I'll try and test it later today or tomorrow. [17:26:06] (03PS1) 10AOkoth: site: change vrts2001 role [puppet] - 10https://gerrit.wikimedia.org/r/917930 (https://phabricator.wikimedia.org/T323515) [17:28:08] (03CR) 10Dzahn: [C: 03+1] "This makes sense since the prod role doesn't work on the first run yet." [puppet] - 10https://gerrit.wikimedia.org/r/917930 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [17:28:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T335845)', diff saved to https://phabricator.wikimedia.org/P48048 and previous config saved to /var/cache/conftool/dbconfig/20230509-172826-ladsgroup.json [17:28:57] !log aokoth@cumin1001 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host vrts2001.codfw.wmnet with OS bullseye [17:29:18] haak: you're live everywhere except on kubernetes, starting that deployment now [17:29:41] (03CR) 10AOkoth: [C: 03+2] site: change vrts2001 role [puppet] - 10https://gerrit.wikimedia.org/r/917930 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [17:29:43] rzl great, thanks a lot! [17:29:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:30:08] (03CR) 10RLazarus: [C: 03+2] Handle Canonical URL for EntitySchemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/912326 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [17:31:08] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts2001.codfw.wmnet with reason: Re-image w/ Bullseye [17:31:10] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts2001.codfw.wmnet with reason: Re-image w/ Bullseye [17:31:14] (03Merged) 10jenkins-bot: Handle Canonical URL for EntitySchemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/912326 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [17:31:22] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reimage for host vrts2001.codfw.wmnet with OS bullseye [17:40:21] (03CR) 10Dzahn: [C: 03+2] "thanks! makes sense. and there is no fully automatic syncing involved. merging." [puppet] - 10https://gerrit.wikimedia.org/r/917908 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [17:42:08] rzl will you ping me when k8s is done as well? [17:42:09] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:42:34] haak: will do [17:42:41] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:42:46] rzl thanks a lot! [17:42:56] doing the k8s mwdebug first, so you can test with the extension again in just a sec [17:43:04] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:43:33] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:43:50] haak: okay, go ahead and try with the mwdebug extension set to "k8s-experimental" [17:44:43] rzl I have server:mediawiki-pinkunicorn-56bcf7dc88-6ftq5 in the response header [17:44:49] and it works [17:44:59] perfect, thanks! next update will be when it's all done [17:45:12] great! [17:45:39] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:46:12] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:46:13] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:46:31] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts2001.codfw.wmnet with reason: host reimage [17:46:36] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:46:45] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:47:11] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:47:12] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:47:23] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo [17:47:48] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:48:06] !log rzl@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [17:48:06] !log rzl@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [17:48:15] !log rzl@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [17:48:16] !log rzl@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [17:48:24] !log rzl@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [17:48:24] !log rzl@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [17:48:36] !log rzl@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [17:48:37] !log rzl@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [17:48:44] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:49:07] (03CR) 10Dzahn: "this makes sense, it's the fix for long-term." [puppet] - 10https://gerrit.wikimedia.org/r/917916 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [17:49:14] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:49:15] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:49:27] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts2001.codfw.wmnet with reason: host reimage [17:49:45] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:50:35] haak: all set, it's everywhere [17:50:59] rzl thank you so much for all the work! [17:51:26] rzl have a lovely evening! [17:51:33] no worries, sorry it ended up taking so long! let me know if you need anything else, otherwise have a good day [17:53:48] (03CR) 10Majavah: puppetserver: add puppetserver module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:55:05] (03CR) 10Xcollazo: "PR that introduced this feature includes `sql_alchemy_conn`: https://github.com/apache/airflow/pull/25346/files#diff-2ab05df659c739ff335a4" [puppet] - 10https://gerrit.wikimedia.org/r/917929 (https://phabricator.wikimedia.org/T315450) (owner: 10Xcollazo) [17:55:38] (03CR) 10Xcollazo: [V: 03+1] "Puppet is happy with changes: https://puppet-compiler.wmflabs.org/output/917929/41095/" [puppet] - 10https://gerrit.wikimedia.org/r/917929 (https://phabricator.wikimedia.org/T315450) (owner: 10Xcollazo) [17:58:43] (03PS2) 10Majavah: wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477) [17:59:14] 10SRE, 10Content-Transform-Team-WIP, 10RESTBase, 10Traffic, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Kappakayala) @KOfori, Could you please have someone from your team to help with consultation. Based on my chat with Frantz,... [18:00:05] hashar and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1800). [18:00:58] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914268 (owner: 10Majavah) [18:01:41] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host vrts2001.codfw.wmnet with OS bullseye [18:02:05] 10SRE, 10Content-Transform-Team-WIP, 10RESTBase, 10Traffic, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10KOfori) @Kappakayala indeed. Had a quick chat earlier with @FJoseph-WMF and briefly with the team. We'll set something up to... [18:04:10] (03CR) 10Ottomata: [C: 03+2] Don't expose sensitive config to Airflow UI users. [puppet] - 10https://gerrit.wikimedia.org/r/917929 (https://phabricator.wikimedia.org/T315450) (owner: 10Xcollazo) [18:04:51] (03CR) 10Ottomata: [C: 03+2] "Merged! I think (after puppet runs everywhere in like 30 mins), airflow instances will need restarted?" [puppet] - 10https://gerrit.wikimedia.org/r/917929 (https://phabricator.wikimedia.org/T315450) (owner: 10Xcollazo) [18:06:25] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo [18:07:51] PROBLEM - Check systemd state on dbstore1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@staging.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:17:24] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) @akosiaris Fair enough. I also highly doubt it's anything MITM related. There is something weird going on, but it's spurious, so... [18:21:15] RECOVERY - OSPF status on lsw1-e1-eqiad.mgmt is OK: OSPFv2: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:22:20] (03PS5) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [18:22:27] (03CR) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [18:23:07] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Papaul) @Marostegui i think firmware cookbok upgrade power the server down but i power it back up. [18:23:32] (Access port speed <= 100Mbps) firing: (2) Device asw-c-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:28:42] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin [18:29:14] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) 05Open→03Resolved Applied to all devices now. [18:29:18] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [18:35:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [18:44:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [18:44:41] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed w... [18:45:17] jouncebot: nowandnext [18:45:17] For the next 1 hour(s) and 14 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T1800) [18:45:17] In 1 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T2000) [18:45:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [18:45:32] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [18:50:08] (03PS1) 10Ottomata: page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) [18:50:48] (03CR) 10CI reject: [V: 04-1] page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) (owner: 10Ottomata) [18:51:25] (03PS1) 10Jdrewniak: Add padding to limited-width toggle to account for close icon [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917733 (https://phabricator.wikimedia.org/T336274) [18:51:48] (03PS1) 10Jdrewniak: Add padding to limited-width toggle to account for close icon [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917734 (https://phabricator.wikimedia.org/T336274) [18:52:21] (03PS2) 10Ottomata: page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) [18:52:56] (03CR) 10CI reject: [V: 04-1] page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) (owner: 10Ottomata) [18:53:08] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) a:03Dwisehaupt Cool! Feel free to reach out to me if you hav... [18:55:44] (03PS3) 10Ottomata: page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) [18:57:30] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin [18:59:09] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Dzahn) 05Open→03Stalled Setting to stalled as I don't think there is anyone working on this n... [19:01:07] (03PS4) 10Ottomata: page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) [19:05:01] (03Abandoned) 10Ottomata: Add a kafka consumer group to flink-app instance. [deployment-charts] - 10https://gerrit.wikimedia.org/r/887792 (https://phabricator.wikimedia.org/T329061) (owner: 10Gmodena) [19:06:05] (03CR) 10Ottomata: [C: 03+2] page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) (owner: 10Ottomata) [19:06:49] (03Merged) 10jenkins-bot: page_content_change - use mwapi-async envoy listener for MW api requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/917934 (https://phabricator.wikimedia.org/T333575) (owner: 10Ottomata) [19:07:42] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin [19:08:31] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:08:34] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:10:09] (03PS1) 10Bking: rdf-streaming-updater: Increase task manager memory alloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) [19:14:01] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:39] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:09] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Aklapper) [19:19:16] (03PS1) 10Ryan Kemper: [WIP] wdqs: try calculation=mean [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:24:11] (03PS3) 10Andrew Bogott: O:wmcs::nfs: delete old primary role files [puppet] - 10https://gerrit.wikimedia.org/r/914269 (owner: 10Majavah) [19:29:18] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:31:32] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:32:57] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:33:50] (03PS2) 10Ryan Kemper: wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:34:22] (03PS5) 10CDanis: add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) [19:34:34] (03CR) 10CDanis: add tunnelencabulator (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [19:34:36] (03PS3) 10Ryan Kemper: wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:34:54] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin [19:35:27] (03CR) 10CDanis: "Let me know if you'd rather update debian/changelog and cut a new release, or if I should 😊" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [19:36:20] (03PS4) 10Ryan Kemper: wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:39:03] (03PS5) 10Ryan Kemper: wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:44:27] (03CR) 10Andrew Bogott: [C: 03+2] O:wmcs::nfs: delete old primary role files [puppet] - 10https://gerrit.wikimedia.org/r/914269 (owner: 10Majavah) [19:46:31] (03PS6) 10Ryan Kemper: wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:47:19] (03CR) 10Andrew Bogott: [C: 03+2] labstore: remove unused files [puppet] - 10https://gerrit.wikimedia.org/r/914272 (owner: 10Majavah) [19:47:25] (03CR) 10Ebernhardson: [C: 03+1] wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 (owner: 10Ryan Kemper) [19:54:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs [19:55:37] (03CR) 10Xcollazo: [V: 03+1] "Thanks for the review @otto!" [puppet] - 10https://gerrit.wikimedia.org/r/917929 (https://phabricator.wikimedia.org/T315450) (owner: 10Xcollazo) [19:57:16] (03PS7) 10Ryan Kemper: wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 [19:58:15] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: remove unneeded avg function [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/917938 (owner: 10Ryan Kemper) [19:58:41] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230509T2000). Please do the needful. [20:00:06] jan_drewniak and urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:00:19] i can deploy today! [20:00:25] hey jan_drewniak, how are you? [20:00:36] urbanecm: o/ [20:00:39] (03CR) 10Urbanecm: [C: 03+2] Add padding to limited-width toggle to account for close icon [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917733 (https://phabricator.wikimedia.org/T336274) (owner: 10Jdrewniak) [20:00:45] (03CR) 10Urbanecm: [C: 03+2] Add padding to limited-width toggle to account for close icon [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917734 (https://phabricator.wikimedia.org/T336274) (owner: 10Jdrewniak) [20:00:57] +2'ed backports. I'll sync my config out now, and ping you once backports can be tested! [20:01:06] urbanecm: thanks! [20:01:31] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [20:01:38] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed w... [20:01:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917925 (owner: 10Urbanecm) [20:02:41] (03Merged) 10jenkins-bot: [Growth] Add mediawiki.mentor_dashboard.personalized_praise stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917925 (owner: 10Urbanecm) [20:03:11] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:917925|[Growth] Add mediawiki.mentor_dashboard.personalized_praise stream]] [20:10:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:917925|[Growth] Add mediawiki.mentor_dashboard.personalized_praise stream]] (duration: 07m 26s) [20:10:48] okay, config patch deployed [20:12:23] urbanecm: any chance you could merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/908337 it's just cleanup of a no longer used config [20:12:31] sure thing arlolra [20:12:38] do you want a chance to test it at a debug server? [20:12:44] nope [20:12:56] okay, let's do it then [20:13:02] thank you [20:13:18] (03PS2) 10Urbanecm: Remove unused parsoidSettings, nativeGalleryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908337 (owner: 10Arlolra) [20:13:22] (03CR) 10Urbanecm: [C: 03+2] Remove unused parsoidSettings, nativeGalleryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908337 (owner: 10Arlolra) [20:14:01] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:07] (03Merged) 10jenkins-bot: Remove unused parsoidSettings, nativeGalleryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908337 (owner: 10Arlolra) [20:14:38] deploying [20:14:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:908337|Remove unused parsoidSettings, nativeGalleryEnabled]] [20:15:38] (03CR) 10Eevans: [C: 03+1] Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [20:18:47] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs [20:19:05] (03Merged) 10jenkins-bot: Add padding to limited-width toggle to account for close icon [skins/Vector] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/917733 (https://phabricator.wikimedia.org/T336274) (owner: 10Jdrewniak) [20:19:11] (03Merged) 10jenkins-bot: Add padding to limited-width toggle to account for close icon [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917734 (https://phabricator.wikimedia.org/T336274) (owner: 10Jdrewniak) [20:19:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs [20:22:09] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:908337|Remove unused parsoidSettings, nativeGalleryEnabled]] (duration: 07m 11s) [20:22:21] arlolra: should be deployed! [20:22:28] jan_drewniak: your patch is next :) [20:22:28] many thanks [20:22:31] np [20:22:52] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:917733|Add padding to limited-width toggle to account for close icon (T336274)]], [[gerrit:917734|Add padding to limited-width toggle to account for close icon (T336274)]] [20:22:55] T336274: Limited-width toggle close button overlaps with text - https://phabricator.wikimedia.org/T336274 [20:24:13] Sounds good [20:24:25] !log urbanecm@deploy1002 urbanecm and jdrewniak: Backport for [[gerrit:917733|Add padding to limited-width toggle to account for close icon (T336274)]], [[gerrit:917734|Add padding to limited-width toggle to account for close icon (T336274)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:24:34] jan_drewniak: your patch is at mwdebug1002 now. can you test? [20:26:18] urbanecm: ok, looks good to sync [20:26:23] great, syncing [20:28:20] (03CR) 10Muehlenhoff: Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [20:28:25] (03PS1) 10Bking: sre.elasticsearch.ban: Improve error message [cookbooks] - 10https://gerrit.wikimedia.org/r/917944 (https://phabricator.wikimedia.org/T331303) [20:31:51] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:917733|Add padding to limited-width toggle to account for close icon (T336274)]], [[gerrit:917734|Add padding to limited-width toggle to account for close icon (T336274)]] (duration: 08m 59s) [20:31:55] T336274: Limited-width toggle close button overlaps with text - https://phabricator.wikimedia.org/T336274 [20:32:17] jan_drewniak: should be deployed [20:32:19] anything else, anyone? [20:33:08] urbanecm: great! Thanks [20:33:13] no problem [20:38:16] (03PS1) 10Ottomata: page_content_change - bump image version to v1.15.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/917948 (https://phabricator.wikimedia.org/T333575) [20:40:40] clear [20:41:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [20:41:58] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [20:42:08] (03CR) 10Ottomata: [C: 03+2] page_content_change - bump image version to v1.15.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/917948 (https://phabricator.wikimedia.org/T333575) (owner: 10Ottomata) [20:42:36] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs [20:42:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [20:42:49] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed w... [20:43:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [20:43:21] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [20:45:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2096.codfw.wmnet with reason: Maintenance [20:45:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2096.codfw.wmnet with reason: Maintenance [20:46:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2096 (T335845)', diff saved to https://phabricator.wikimedia.org/P48049 and previous config saved to /var/cache/conftool/dbconfig/20230509-204604-ladsgroup.json [20:50:21] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:50:25] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:52:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams [20:52:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T335845)', diff saved to https://phabricator.wikimedia.org/P48050 and previous config saved to /var/cache/conftool/dbconfig/20230509-205249-ladsgroup.json [21:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P48051 and previous config saved to /var/cache/conftool/dbconfig/20230509-210755-ladsgroup.json [21:11:08] (03PS6) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [21:11:51] (03CR) 10CI reject: [V: 04-1] New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [21:13:32] (03PS7) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [21:14:09] (03CR) 10CI reject: [V: 04-1] New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [21:16:32] (03PS8) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [21:16:35] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:17:04] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams [21:17:40] (03CR) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [21:18:03] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:19:10] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams [21:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P48052 and previous config saved to /var/cache/conftool/dbconfig/20230509-212302-ladsgroup.json [21:24:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:24:53] PROBLEM - PHP7 rendering on mw1460 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:25:52] (03PS9) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [21:26:21] RECOVERY - PHP7 rendering on mw1460 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 2.812 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:27:15] PROBLEM - Disk space on ms-be2042 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2042&var-datasource=codfw+prometheus/ops [21:29:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:12] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10jhathaway) @MoritzMuehlenhoff I think this updated plan makes sense, my only concern is with our use of `mirrormode on`. I don't have a good understanding of how mirror mode in... [21:29:25] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:52] (03PS10) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) [21:31:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:06] (03CR) 10Ebernhardson: [C: 03+1] sre.elasticsearch.ban: Improve error message [cookbooks] - 10https://gerrit.wikimedia.org/r/917944 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [21:36:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:37:56] (03CR) 10Bking: [C: 03+2] sre.elasticsearch.ban: Improve error message [cookbooks] - 10https://gerrit.wikimedia.org/r/917944 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [21:38:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T335845)', diff saved to https://phabricator.wikimedia.org/P48053 and previous config saved to /var/cache/conftool/dbconfig/20230509-213808-ladsgroup.json [21:38:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [21:38:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [21:38:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T335845)', diff saved to https://phabricator.wikimedia.org/P48054 and previous config saved to /var/cache/conftool/dbconfig/20230509-213834-ladsgroup.json [21:41:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:42:05] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams [21:43:19] PROBLEM - PHP7 jobrunner on mw1460 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:43:55] ACKNOWLEDGEMENT - SSH on wcqs1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King Investigating now https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:43:55] ACKNOWLEDGEMENT - Host wcqs1002 is DOWN: PING CRITICAL - Packet loss = 100% Brian_King Investigating now [21:44:17] RECOVERY - Host wcqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [21:45:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [21:45:27] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye executed w... [21:46:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:13] (SystemdUnitFailed) firing: nginx.service Failed on wcqs1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:39] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Ladsgroup) According to https://meta.wikimedia.org/wiki/Mailing_lists/Standardization the address of that mailing list will be wikiquote-en-admins@lists.wikimedia.org I need a second ad... [21:46:44] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Ladsgroup) a:03Ladsgroup [21:48:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T335845)', diff saved to https://phabricator.wikimedia.org/P48055 and previous config saved to /var/cache/conftool/dbconfig/20230509-214827-ladsgroup.json [21:49:15] PROBLEM - PHP7 rendering on mw1460 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:50:26] (03CR) 10Hashar: "Note that this change is merely to reserve uid/gid :]" [puppet] - 10https://gerrit.wikimedia.org/r/917916 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [21:51:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:13] (SystemdUnitFailed) resolved: nginx.service Failed on wcqs1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wcqs1002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:54:17] RECOVERY - PHP7 jobrunner on mw1460 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.512 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:56:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:33] PROBLEM - PHP7 jobrunner on mw1460 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:01:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P48056 and previous config saved to /var/cache/conftool/dbconfig/20230509-220333-ladsgroup.json [22:05:33] PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:06:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:06:21] !log bking@wcqs1002 depool wcqs1002 while it catches up on lag [22:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:42] (03CR) 10Andrew Bogott: [C: 03+2] P::ldap::client::labs: drop support for production [puppet] - 10https://gerrit.wikimedia.org/r/914270 (owner: 10Majavah) [22:07:01] RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.885 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:07:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:09:57] RECOVERY - PHP7 jobrunner on mw1460 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.887 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:11:22] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:13:09] PROBLEM - PHP7 jobrunner on mw1460 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:13:55] PROBLEM - PHP7 rendering on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:14:07] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:29] RECOVERY - PHP7 rendering on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:18:33] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw [22:18:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P48057 and previous config saved to /var/cache/conftool/dbconfig/20230509-221840-ladsgroup.json [22:18:51] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:36] (03PS1) 10Papaul: Add cloudcontrol2001-dev with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/917959 (https://phabricator.wikimedia.org/T336236) [22:22:12] (03CR) 10Papaul: [C: 03+2] Add cloudcontrol2001-dev with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/917959 (https://phabricator.wikimedia.org/T336236) (owner: 10Papaul) [22:23:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [22:23:27] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet w... [22:23:32] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:28:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [22:30:35] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, 10wikimediafoundation.org: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Varnent) >>! In T336301#8838531, @Herald wrote: > As the #WMF-Legal project tag was added to this task, some general information to avoid wrong ex... [22:32:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [22:33:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T335845)', diff saved to https://phabricator.wikimedia.org/P48058 and previous config saved to /var/cache/conftool/dbconfig/20230509-223346-ladsgroup.json [22:38:47] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw [22:42:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw [22:46:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:51:49] (03CR) 10Eevans: [C: 03+1] Make a generic Cassandra reboot cookbook, spin off from former aqs cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [22:53:34] RECOVERY - PHP7 jobrunner on mw1460 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:54:14] PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:54:20] PROBLEM - PHP7 rendering on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:55:22] RECOVERY - PHP7 rendering on mw1460 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:56:33] (03PS1) 10Krinkle: Fix oversample naming to match schema. [extensions/NavigationTiming] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917736 (https://phabricator.wikimedia.org/T332012) [22:56:56] RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.932 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:57:06] RECOVERY - PHP7 rendering on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.489 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:00:36] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw [23:02:34] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad [23:08:48] (03PS2) 10Dwisehaupt: Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) [23:09:25] (03CR) 10CI reject: [V: 04-1] Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [23:22:49] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad [23:25:23] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad [23:33:48] (03CR) 10Dzahn: [C: 03+1] "I appreciate the detailed reply. Alright, yes, regardless of the details this is definitely a step in the right direction / doesn't hurt. " [puppet] - 10https://gerrit.wikimedia.org/r/917916 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [23:35:42] (03CR) 10Dzahn: [C: 03+2] admin: reserve jenkins and zuul uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/917916 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [23:41:01] (03PS7) 10Tim Starling: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:42:08] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, 10wikimediafoundation.org: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Dzahn) Hi @Varnent this is also my team, like the ticket about redirects for annual.wikimedia.org. That is the #serviceops-collab tag nowadays. T... [23:43:39] 10SRE, 10WMF-Legal, 10serviceops-collab, 10wikimediafoundation.org: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Dzahn) [23:43:46] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad [23:46:38] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) a:03darthmon_wmde [23:46:50] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) 05Open→03In progress [23:47:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) p:05Triage→03Medium [23:47:18] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) 05Open→03In progress [23:47:26] (03CR) 10Tim Starling: [C: 03+1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:47:29] (03CR) 10Aaron Schulz: [C: 03+1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [23:49:52] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) 05Open→03In progress p:05Triage→03Medium