[00:00:11] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:29] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:08] (03CR) 10Cwhite: [C: 03+1] "Overall LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [00:35:13] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1080.eqiad.wmnet with OS bullseye [00:35:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye [01:31:31] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1080.eqiad.wmnet with OS bullseye [01:31:41] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye executed with errors: - cp1080 (**FAIL**) - Removed from Pu... [01:38:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1081.eqiad.wmnet with OS bullseye [01:38:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1081.eqiad.wmnet with OS bullseye [02:00:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1081.eqiad.wmnet with reason: host reimage [02:03:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1081.eqiad.wmnet with reason: host reimage [02:10:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1080.eqiad.wmnet with OS bullseye [02:23:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye [02:27:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1081.eqiad.wmnet with OS bullseye [02:27:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1081.eqiad.wmnet with OS bullseye completed: - cp1081 (**PASS**) - Downtimed on Icinga/Alertm... [02:28:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [02:28:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1081.eqiad.wmnet,service=cdn [02:28:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1081.eqiad.wmnet,service=ats-be [02:41:08] (03PS1) 10RLazarus: Comments only: Move the test case format documentation to wikitech [software/httpbb] - 10https://gerrit.wikimedia.org/r/886202 [02:41:10] (03PS1) 10RLazarus: Cleanup: Drop pre-python3.7 support [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 [02:44:22] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1080.eqiad.wmnet with reason: host reimage [02:47:35] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1080.eqiad.wmnet with reason: host reimage [03:09:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:09:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1080.eqiad.wmnet with OS bullseye [03:09:52] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye completed: - cp1080 (**PASS**) - Removed from Puppet and Pu... [03:10:18] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [03:12:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:13:34] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [03:20:28] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1080.eqiad.wmnet [03:20:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [03:21:39] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1083.eqiad.wmnet with OS bullseye [03:21:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1083.eqiad.wmnet with OS bullseye [03:21:55] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1082.eqiad.wmnet with OS bullseye [03:22:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1082.eqiad.wmnet with OS bullseye [03:43:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1083.eqiad.wmnet with reason: host reimage [03:43:32] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1082.eqiad.wmnet with reason: host reimage [03:46:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1083.eqiad.wmnet with reason: host reimage [03:48:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1082.eqiad.wmnet with reason: host reimage [04:11:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1082.eqiad.wmnet with OS bullseye [04:11:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1082.eqiad.wmnet with OS bullseye completed: - cp1082 (**PASS**) - Downtimed on Icinga/Alertm... [04:11:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1083.eqiad.wmnet with OS bullseye [04:11:43] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1083.eqiad.wmnet with OS bullseye completed: - cp1083 (**PASS**) - Downtimed on Icinga/Alertm... [04:16:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (install3002, ...), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:24:20] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet [04:24:25] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1083.eqiad.wmnet [04:25:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [04:25:23] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1085.eqiad.wmnet with OS bullseye [04:25:28] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1084.eqiad.wmnet with OS bullseye [04:25:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1085.eqiad.wmnet with OS bullseye [04:25:38] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1084.eqiad.wmnet with OS bullseye [04:47:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1085.eqiad.wmnet with reason: host reimage [04:47:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1084.eqiad.wmnet with reason: host reimage [04:47:10] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp1084.eqiad.wmnet with reason: host reimage [04:50:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1085.eqiad.wmnet with reason: host reimage [04:53:35] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [04:53:35] PROBLEM - Ensure traffic_manager is running for instance backend on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:55:13] PROBLEM - Check Varnish UDS /run/varnish-frontend-0.socket on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Varnish [04:55:13] PROBLEM - Webrequests Varnishkafka log producer on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [04:55:13] PROBLEM - Ensure traffic_server is running for instance backend on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:56:59] PROBLEM - Check Varnish UDS /run/varnish-frontend-1.socket on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Varnish [04:56:59] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:56:59] PROBLEM - check_trafficserver_backend_config_status on cp1084 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.68: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:57:43] RECOVERY - Ensure traffic_server is running for instance backend on cp1084 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:58:17] RECOVERY - check_trafficserver_backend_config_status on cp1084 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:58:17] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp1084 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:58:23] RECOVERY - Ensure traffic_manager is running for instance backend on cp1084 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:03:39] PROBLEM - Host cp1084 is DOWN: PING CRITICAL - Packet loss = 100% [05:06:23] RECOVERY - Host cp1084 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [05:07:29] PROBLEM - Webrequests Varnishkafka log producer on cp1084 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [05:09:03] RECOVERY - Check Varnish UDS /run/varnish-frontend-0.socket on cp1084 is OK: OK: varnish UDS working as expected https://wikitech.wikimedia.org/wiki/Varnish [05:09:59] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 466 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:10:01] RECOVERY - Webrequests Varnishkafka log producer on cp1084 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [05:10:01] RECOVERY - Check Varnish UDS /run/varnish-frontend-1.socket on cp1084 is OK: OK: varnish UDS working as expected https://wikitech.wikimedia.org/wiki/Varnish [05:13:07] PROBLEM - Check systemd state on cp1084 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter@frontend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1085.eqiad.wmnet with OS bullseye [05:13:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1085.eqiad.wmnet with OS bullseye completed: - cp1085 (**PASS**) - Downtimed on Icinga/Alertm... [05:15:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1084.eqiad.wmnet with OS bullseye [05:15:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1084.eqiad.wmnet with OS bullseye completed: - cp1084 (**WARN**) - Downtimed on Icinga/Alertm... [05:16:21] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1084.eqiad.wmnet [05:16:28] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1085.eqiad.wmnet [05:16:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [05:17:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:45:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [05:47:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:48:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230203T0700) [07:07:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 177, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:10:26] (03PS1) 10Marostegui: db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/886221 (https://phabricator.wikimedia.org/T328404) [07:11:10] (03CR) 10Marostegui: [C: 03+2] db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/886221 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [07:31:48] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) a:05Jclark-ctr→03MoritzMuehlenhoff [07:45:28] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ayounsi) Install servers are being migrated to Bullseye in {T327867} so even though the observed issue is probably not related, it would be better to... [07:58:56] (03PS1) 10Marostegui: mariadb: Disable notifications dbproxy2* [puppet] - 10https://gerrit.wikimedia.org/r/886312 [07:59:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications dbproxy2* [puppet] - 10https://gerrit.wikimedia.org/r/886312 (owner: 10Marostegui) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230203T0800) [08:01:12] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:03:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10MGerlach) [08:04:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10MGerlach) Hey SRE/Analytics/Legal -- we have a new contractor onboard: @AKhatun_WMF . She needs access to HDFS and the stat machines for a new research project. Don't h... [08:09:23] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AKhatun - https://phabricator.wikimedia.org/T328734 (10MGerlach) [08:16:35] (03PS1) 10Muehlenhoff: Point proxy in eqsin to install5002 [dns] - 10https://gerrit.wikimedia.org/r/886313 (https://phabricator.wikimedia.org/T327867) [08:17:38] (03PS1) 10Muehlenhoff: Assign installserver role to install5002 [puppet] - 10https://gerrit.wikimedia.org/r/886314 (https://phabricator.wikimedia.org/T327867) [08:24:06] (03CR) 10Muehlenhoff: [C: 03+2] Assign installserver role to install5002 [puppet] - 10https://gerrit.wikimedia.org/r/886314 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:30:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) >> However using Calico's numAllowedLocalASNumbers config knob will be needed, as all the nodes from a given cluster use the same AS#. > > You could als... [08:34:12] 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10MoritzMuehlenhoff) [08:45:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "<3 <3 <3" [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [08:45:45] (JobUnavailable) firing: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:47:09] (03PS1) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [08:49:19] (03CR) 10Muehlenhoff: elasticsearch: service depends on tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [08:54:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [08:57:15] (03PS2) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [08:57:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [08:58:53] (03CR) 10Jelto: [C: 03+2] gitlab runners: add dependabot-gitlab & elasticsearch to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/886128 (https://phabricator.wikimedia.org/T326507) (owner: 10Brennen Bearnes) [09:02:11] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:02:26] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:04:05] PROBLEM - HTTP on install5002 is CRITICAL: connect to address 103.102.166.12 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [09:05:20] ^ that's a new host, still during initial puppet run [09:05:43] PROBLEM - Squid on install5002 is CRITICAL: connect to address 103.102.166.12 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [09:07:33] !log installing modsecurity-crs security updates [09:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:33] RECOVERY - HTTP on install5002 is OK: HTTP OK: HTTP/1.1 200 OK - 845 bytes in 0.485 second response time https://wikitech.wikimedia.org/wiki/Install_servers [09:13:44] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:13:55] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:14:13] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet [09:14:33] RECOVERY - Squid on install5002 is OK: TCP OK - 0.329 second response time on 103.102.166.12 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [09:15:45] (JobUnavailable) resolved: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:34] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet [09:23:26] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:23:30] (03PS2) 10Muehlenhoff: Point proxy in eqsin to install5002 [dns] - 10https://gerrit.wikimedia.org/r/886313 (https://phabricator.wikimedia.org/T327867) [09:23:39] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:24:07] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:24:18] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:26:08] (03PS2) 10Phuedx: Revert "Request high-entropy Sec-CH-UA* client hints" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) [09:29:45] (03CR) 10Filippo Giunchedi: elasticsearch: service depends on tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [09:31:19] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:31:31] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:31:35] (03CR) 10Muehlenhoff: [C: 03+2] Point proxy in eqsin to install5002 [dns] - 10https://gerrit.wikimedia.org/r/886313 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [09:34:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:36:07] RECOVERY - Check systemd state on cp1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:42] (03PS1) 10Muehlenhoff: Point DHCP server for eqsin to install5002 [puppet] - 10https://gerrit.wikimedia.org/r/886320 (https://phabricator.wikimedia.org/T327867) [09:36:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:36:53] (03PS1) 10Alexandros Kosiaris: DNM: Showcase row-level mesh in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) [09:37:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:41:27] (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server in eqsin to install5002 [homer/public] - 10https://gerrit.wikimedia.org/r/886053 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [09:44:45] (03PS3) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [09:45:53] (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/886013 (owner: 10Stevemunene) [09:46:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:48:14] (03CR) 10Vgutierrez: [C: 03+2] varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [09:49:20] (03PS4) 10Elukey: WIP - Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [09:50:59] (03CR) 10Stevemunene: [C: 03+2] Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886013 (owner: 10Stevemunene) [09:51:58] !log installing ruby-rack security updates [09:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:31] (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server for eqsin to install5002 [puppet] - 10https://gerrit.wikimedia.org/r/886320 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [09:52:40] (03PS2) 10Jbond: puppet-merge: try to decode with erros=ignore on failure [puppet] - 10https://gerrit.wikimedia.org/r/886006 [09:55:31] (03CR) 10Jbond: [C: 03+2] puppet-merge: try to decode with erros=ignore on failure [puppet] - 10https://gerrit.wikimedia.org/r/886006 (owner: 10Jbond) [09:58:45] (03CR) 10Jcrespo: [C: 03+1] puppet-merge: try to decode with erros=ignore on failure [puppet] - 10https://gerrit.wikimedia.org/r/886006 (owner: 10Jbond) [10:00:00] (03CR) 10Slyngshede: [V: 03+2] Switch to CAS OIDC for login button. [software/bitu] - 10https://gerrit.wikimedia.org/r/886010 (owner: 10Slyngshede) [10:00:02] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Switch to CAS OIDC for login button. [software/bitu] - 10https://gerrit.wikimedia.org/r/886010 (owner: 10Slyngshede) [10:00:38] (03PS1) 10Vgutierrez: Revert "varnish: support differential privacy" [puppet] - 10https://gerrit.wikimedia.org/r/886095 (https://phabricator.wikimedia.org/T315676) [10:01:00] (03CR) 10CI reject: [V: 04-1] Revert "varnish: support differential privacy" [puppet] - 10https://gerrit.wikimedia.org/r/886095 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [10:01:45] (03PS2) 10Vgutierrez: Revert "varnish: support differential privacy" [puppet] - 10https://gerrit.wikimedia.org/r/886095 (https://phabricator.wikimedia.org/T315676) [10:02:38] (03CR) 10Vgutierrez: [C: 03+2] Revert "varnish: support differential privacy" [puppet] - 10https://gerrit.wikimedia.org/r/886095 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [10:03:02] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10BTullis) 05Open→03Resolved As far as I am aware, we don't actually need to change any files on the puppetmaster(s) when the Maxmind licence is renewed. The... [10:03:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2008.codfw.wmnet [10:06:58] !log stevemunene@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:07:02] (03PS1) 10Muehlenhoff: Remove installserver role from install5001 [puppet] - 10https://gerrit.wikimedia.org/r/886326 [10:07:14] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:07:22] (03CR) 10CI reject: [V: 04-1] Remove installserver role from install5001 [puppet] - 10https://gerrit.wikimedia.org/r/886326 (owner: 10Muehlenhoff) [10:07:54] (03PS5) 10Elukey: WIP - Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [10:09:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2008.codfw.wmnet [10:10:26] (03PS2) 10Muehlenhoff: Remove installserver role from install5001 [puppet] - 10https://gerrit.wikimedia.org/r/886326 (https://phabricator.wikimedia.org/T327867) [10:11:15] !log stevemunene@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:12:17] (03PS1) 10Ayounsi: Allow AS loops in eqiad staging k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/886328 (https://phabricator.wikimedia.org/T328523) [10:12:19] (03PS1) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [10:12:37] (03PS1) 10Slyngshede: C:IDM Enable the group creating pipeline. [puppet] - 10https://gerrit.wikimedia.org/r/886331 [10:13:06] (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install5001 [puppet] - 10https://gerrit.wikimedia.org/r/886326 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [10:13:14] (03CR) 10CI reject: [V: 04-1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [10:14:47] (03PS19) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [10:15:59] (03PS6) 10Elukey: WIP - Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [10:19:08] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [10:20:25] (03PS7) 10Elukey: WIP - Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [10:20:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [10:22:39] PROBLEM - TFTP service on install5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [10:22:51] PROBLEM - HTTP on install5001 is CRITICAL: connect to address 103.102.166.13 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [10:22:55] PROBLEM - Squid on install5001 is CRITICAL: connect to address 103.102.166.13 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [10:24:08] ^ monitoring glitch, now unused [10:24:10] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:25:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM. Adding Janis too for a comment on the approach." [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [10:28:48] (03CR) 10Jbond: [C: 03+1] "leaving a not here so i don't forget, i looked at using this for the use case in th redfish api. specifically to upload a large file howe" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/885808 (owner: 10Volans) [10:29:19] (03CR) 10Elukey: "Tested the cookbook with a local checkout on cumin2002 + custom config. I am still not able to test Netbox's functionality since I keep ge" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:31:01] (03PS5) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) [10:31:23] (03CR) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [10:31:34] (03PS1) 10Jelto: idp: add gitlab-replica-old to gitlab-replica service_id [puppet] - 10https://gerrit.wikimedia.org/r/886333 (https://phabricator.wikimedia.org/T328635) [10:34:09] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10SLyngshede-WMF) In tests others have seen the following message: NetboxHostNotFoundError [10:34:17] (03PS2) 10Jelto: idp: add gitlab-replica-old to gitlab-replica service_id [puppet] - 10https://gerrit.wikimedia.org/r/886333 (https://phabricator.wikimedia.org/T328635) [10:37:57] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10elukey) Tried to test the new ganeti cookbook with a local checkout of the cookbook repo + custom config on cu... [10:42:24] 10SRE, 10Infrastructure-Foundations, 10netops: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10cmooney) p:05Triage→03Low [10:44:03] !log updating perf on buster hosts [10:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886333 (https://phabricator.wikimedia.org/T328635) (owner: 10Jelto) [10:52:03] (03PS1) 10Jelto: gitlab: get gitlab url from config while restoring [puppet] - 10https://gerrit.wikimedia.org/r/886336 (https://phabricator.wikimedia.org/T328635) [10:53:36] (03CR) 10Muehlenhoff: elasticsearch: service depends on tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [10:54:22] (03CR) 10Jbond: [C: 03+1] "lgtm, just some minor nits (not tested)" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:54:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] Adding Kavitha Appakayala to icinga [puppet] - 10https://gerrit.wikimedia.org/r/885985 (https://phabricator.wikimedia.org/T327403) (owner: 10Alexandros Kosiaris) [10:56:11] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) [10:58:59] (03CR) 10Jbond: [C: 03+1] "LGTM, can discuss comment on irc" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [10:59:02] (03PS1) 10Vgutierrez: varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/886337 (https://phabricator.wikimedia.org/T315676) [11:01:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39380/console" [puppet] - 10https://gerrit.wikimedia.org/r/886337 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [11:04:10] (03CR) 10Vgutierrez: [V: 03+1] "@bblack please let me know if the vcl_config approach looks good to you, pcc output seems sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/886337 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [11:06:42] (03CR) 10Muehlenhoff: "Looks good, few nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [11:07:18] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) @jcross @Htriedman we had some issues after merging the Differential Privacy CR this morning and I reverted it shortly after. https://gerrit.wikimedia.org/r/c/operations/puppe... [11:09:56] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment_id in group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885898 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [11:11:04] (03PS2) 10Ayounsi: Allow AS loops in eqiad staging k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/886328 (https://phabricator.wikimedia.org/T328523) [11:11:06] (03PS2) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [11:11:53] (03CR) 10CI reject: [V: 04-1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [11:16:58] (03PS3) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [11:17:31] (03CR) 10Ayounsi: Add BGP community to all k8s advertisments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [11:17:46] (03CR) 10CI reject: [V: 04-1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [11:19:12] (03PS20) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [11:22:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:23:03] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [11:27:19] (03PS8) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:28:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2009.codfw.wmnet [11:29:37] (03CR) 10Elukey: "Tested with a local checkout of the cookbook repo + the code review for sre.ganeti.reimage + changes on top of it (basically all the code " [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:30:56] (03CR) 10Klausman: [C: 03+1] Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:33:01] (03CR) 10Jbond: [C: 04-1] "-1: mostly looks good just a few comments and one issue" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:35:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2009.codfw.wmnet [11:41:41] 10SRE, 10Infrastructure-Foundations, 10netops: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10cmooney) [11:41:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) [11:50:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2010.codfw.wmnet [11:51:16] (03CR) 10Ayounsi: "Thanks! Great approach!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [11:55:47] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable leveling up features on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886342 (https://phabricator.wikimedia.org/T328757) [11:55:49] (03PS1) 10Kosta Harlan: GrowthExperiments: Disable leveling up features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) [11:58:04] !log installing node-qs security updates [11:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2010.codfw.wmnet [12:00:18] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:56] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@598ff3c] (releasing): (no justification provided) [12:01:09] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@598ff3c] (releasing): (no justification provided) (duration: 00m 13s) [12:03:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10cmooney) > Unfortunately, as mentioned in https://blog.ipspace.net/2021/09/graceful-restart.html "BGP Graceful Restart (RFC 4724) looks like it’s been designed by cowboys" as there is... [12:05:30] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:28] !log installing node-moment security updates [12:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:28] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) We are now correctly sending, ingesting and storing slowlogs in ECS format. Next step, dashboards. [12:18:48] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [12:20:29] (03PS28) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [12:20:33] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) Thank you Clément. @RZamora-WMF is my backup for this task. She will review all steps when I made the... [12:21:22] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [12:21:42] (03PS8) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [12:21:46] (03PS1) 10Jaime Nuche: jenkins: modify sudo rule to allow passing proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) [12:35:26] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/886348/39381/" [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [12:39:21] 10SRE, 10MW-on-K8s, 10serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Clement_Goubert) I don't think this should be considered a blocker for {T327920} However, we should address it for mw-on-k8s and releases. [12:54:26] (03PS3) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 [12:55:02] (03CR) 10CI reject: [V: 04-1] Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 (owner: 10Jcrespo) [12:55:30] (03PS2) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [12:59:03] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Clement_Goubert) [13:00:21] (03PS4) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 [13:00:29] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Clement_Goubert) p:05Triage→03Medium [13:01:32] 10SRE, 10serviceops, 10Patch-For-Review: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) I don't think this should be considered a blocker for {T327920}. [13:02:55] (03PS15) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:05:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1087.eqiad.wmnet with OS bullseye [13:05:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1087.eqiad.wmnet with OS bullseye [13:06:29] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:09:14] 10SRE, 10DBA, 10Data-Persistence, 10cloud-services-team, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10jcrespo) CC dbas & cloud- This worries me- while labswiki won't have a lot of queries- there is no way to migrate the user to the other datacente... [13:10:01] 10SRE, 10DBA, 10Data-Persistence, 10cloud-services-team, and 4 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10taavi) [13:11:48] 10SRE, 10Prod-Kubernetes, 10PyBal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) 05Open→03Declined I am gonna tentatively set this as `declined`. The Service IPs announcement path led to nowh... [13:11:54] (03CR) 10Ssingh: [C: 03+2] Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:11:58] (03CR) 10Jelto: "@Moritz can you take a look here. I'm not entirely sure of using wildcards in sudo rules. Do you think that's fine in this context?" [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:15:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) 05Open→03Resolved a:03ayounsi After a discussion with @akosiaris the initial BFD need was for an Anycast experiment and as explained in T238909#8585199 this is not in sc... [13:15:19] 10SRE, 10Prod-Kubernetes, 10PyBal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10ayounsi) [13:16:19] (03PS16) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:18:38] (03CR) 10Slyngshede: C:IDM Add timers and background workers. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:19:17] 10SRE, 10DBA, 10Data-Persistence, 10cloud-services-team, and 4 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Marostegui) Unfortunately we can't do anything with the DB. So we might have cross DC queries unless #cloud-services-team can set up a labweb hos... [13:24:10] 10SRE, 10Datacenter-Switchover: Add step to rsync home dirs on mwmaint hosts before DC switchover - https://phabricator.wikimedia.org/T287303 (10akosiaris) 05Open→03Declined I am closing as `Declined`. We 've taken no action in 1.5 year. Also, @volans is correct that there is quite a bit of complexity in t... [13:25:27] (03CR) 10Jaime Nuche: jenkins: modify sudo rule to allow passing proxy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:25:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage [13:27:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage [13:29:06] (03PS1) 10Hnowlan: WIP: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [13:29:10] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [13:29:44] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:09] (03PS21) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [13:30:11] (03PS1) 10Jbond: tox.ini: Add dependencies to allow tox to install [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 [13:32:12] (03CR) 10CI reject: [V: 04-1] WIP: helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [13:33:55] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [13:34:05] (03CR) 10CI reject: [V: 04-1] tox.ini: Add dependencies to allow tox to install [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (owner: 10Jbond) [13:40:22] (03PS9) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [13:41:37] (03CR) 10Elukey: "Thanks for the review John!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [13:44:48] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10akosiaris) > Feel free to assign over to me or @RKemper for provisioning if/when the request is approved. No need for explicit approval. The reason we have these tasks... [13:45:13] (03PS10) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [13:46:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "As the original perp of this script, it failed to serve the purpose I created it for, so +1" [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [13:47:36] (03PS11) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [13:48:03] (03CR) 10BBlack: [C: 03+1] varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/886337 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [13:48:14] (03PS1) 10Jbond: puppet_compiler: prepare conftool mediawiki file for dc changeover [puppet] - 10https://gerrit.wikimedia.org/r/886360 (https://phabricator.wikimedia.org/T290665) [13:48:57] (03CR) 10Elukey: "Tested again on cumin1001 from my local deployment in dry-run, all good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [13:51:10] (03PS1) 10EoghanGaffney: Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) [13:51:30] (03CR) 10CI reject: [V: 04-1] Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:51:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1087.eqiad.wmnet with OS bullseye [13:52:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1087.eqiad.wmnet with OS bullseye completed: - cp1087 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [13:55:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [13:55:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet,service=cdn [13:55:20] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet,service=ats-be [13:58:32] (03PS2) 10EoghanGaffney: Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) [13:58:53] (03CR) 10jenkins-bot: Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [14:01:34] (03PS2) 10Jaime Nuche: jenkins: modify sudo rule to allow passing proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) [14:01:36] (03PS9) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [14:01:38] (03PS3) 10EoghanGaffney: Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) [14:02:26] (03PS1) 10Jbond: configmaster: add conftool-state mediawiki.yaml file to config-master [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) [14:04:32] (03CR) 10Jelto: [C: 03+1] "looks better now without additional wildcards 👍" [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:08:39] (03PS29) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [14:09:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:10:14] (03CR) 10Jelto: [C: 03+2] jenkins: modify sudo rule to allow passing proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:13:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2011.codfw.wmnet [14:15:24] (03CR) 10Hashar: "Nice trick, very well done ;) 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/886348 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:18:39] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@598ff3c] (releasing): (no justification provided) [14:19:03] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@598ff3c] (releasing): (no justification provided) (duration: 00m 23s) [14:20:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2011.codfw.wmnet [14:24:38] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:25:38] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Marostegui) Leaving #data-persistence tag instead of #DBA. We can support #cloud-services-team as much as needed, but we can't really do a... [14:27:24] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:17] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@598ff3c] (releasing): (no justification provided) [14:31:27] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@598ff3c] (releasing): (no justification provided) (duration: 00m 09s) [14:47:27] (03CR) 10Jelto: [C: 03+1] "lgtm. I left one comment in-line suggesting systemds StandardOutput and StandardError (which should be available on vrts hosts). I guess t" [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [14:51:06] (03CR) 10Clément Goubert: [C: 03+1] "Thanks! LGTM since it at least removes a manual step after DC switchover." [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [14:55:03] 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10jbond) p:05Triage→03Medium [14:55:57] 10SRE-tools, 10Discovery-Search, 10Elasticsearch, 10Infrastructure-Foundations, 10Spicerack: elasticsearch spicerack module failes with most recent elastic-curator - https://phabricator.wikimedia.org/T328775 (10bking) [14:56:21] (03PS2) 10Jbond: tox.ini: Add dependencies to allow tox to install [software/spicerack] - 10https://gerrit.wikimedia.org/r/886359 (https://phabricator.wikimedia.org/T328775) [14:58:36] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10jcrespo) >>! In T328768#8585356, @Marostegui wrote: > Leaving #data-persistence tag instead of #DBA. We can support #cloud-services-team a... [14:59:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) Hi @Jclark-ctr would you mind if we try to do some work on this one day next week? We can just start by trying to move two cards a... [15:00:37] (03PS22) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [15:02:44] (03PS7) 10Clément Goubert: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [15:04:14] (03PS2) 10Jbond: configmaster: add conftool-state mediawiki.yaml file to config-master [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) [15:04:41] (03CR) 10CI reject: [V: 04-1] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [15:05:09] (03CR) 10Clément Goubert: "Adding members of data-persistence to reviewers for sanity-check." [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [15:07:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39383/console" [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [15:07:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10Jclark-ctr) Any day next week except Monday [15:08:17] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) I've rebased and implemented one of @Volans recommandation on the CR... [15:08:55] (03PS8) 10Clément Goubert: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [15:11:44] (03PS23) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [15:13:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] configmaster: add conftool-state mediawiki.yaml file to config-master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [15:14:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "burn it with fire!" [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [15:22:59] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@ec3e0de]: Hotfix disabling skein log collection [15:23:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] configmaster: add conftool-state mediawiki.yaml file to config-master [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [15:23:14] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@ec3e0de]: Hotfix disabling skein log collection (duration: 00m 15s) [15:25:36] (03CR) 10Filippo Giunchedi: "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [15:26:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] configmaster: add conftool-state mediawiki.yaml file to config-master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886362 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [15:26:31] (03PS1) 10Dreamy Jazz: Always collapse by default the CheckUserHelper on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886370 (https://phabricator.wikimedia.org/T328726) [15:27:46] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:25] (03PS10) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [15:32:27] (03PS1) 10Jaime Nuche: jenkins: hardcode proxy values in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886372 (https://phabricator.wikimedia.org/T323909) [15:32:29] (03PS1) 10Jaime Nuche: jenkins: remove hardcoded values from sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) [15:32:56] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:55] that's me ^ investigating why sync is failing [15:36:07] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) More updates. @Eevans pointed out that we do now have some clients setting the expiry headers, so it was worth checking the state of the expiry queue. I did so with [[ https:... [15:42:22] (03PS2) 10Jbond: puppet_compiler: drop static file, we get this from config-master [puppet] - 10https://gerrit.wikimedia.org/r/886360 (https://phabricator.wikimedia.org/T290665) [15:42:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: drop static file, we get this from config-master [puppet] - 10https://gerrit.wikimedia.org/r/886360 (https://phabricator.wikimedia.org/T290665) (owner: 10Jbond) [15:44:59] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39384/console" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) (owner: 10Phuedx) [15:47:27] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] Revert "Request high-entropy Sec-CH-UA* client hints" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) (owner: 10Phuedx) [15:47:31] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/886372 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:49:10] (03CR) 10Jelto: [C: 03+2] jenkins: hardcode proxy values in sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886372 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [15:49:39] (03PS9) 10Bking: wdqs/data-reload.py: validate dump date [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [15:51:08] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@598ff3c] (releasing): test [15:51:35] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@598ff3c] (releasing): test (duration: 00m 26s) [15:54:26] (03CR) 10Herron: opensearch: reverse-proxy access to opensearch API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [15:54:33] (03PS1) 10Ilias Sarantopoulos: httpbb: liftiwing add new API tests [puppet] - 10https://gerrit.wikimedia.org/r/886375 (https://phabricator.wikimedia.org/T327787) [15:56:28] (03PS2) 10Ilias Sarantopoulos: httpbb: liftiwing add new API tests [puppet] - 10https://gerrit.wikimedia.org/r/886375 (https://phabricator.wikimedia.org/T327787) [15:58:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) [16:01:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) >>! In T318696#8585442, @Jclark-ctr wrote: > Any day next week except Monday Great! Let's book it in for Wednesday next week 9:0... [16:01:38] (03PS2) 10Clément Goubert: configmaster: Remove disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) [16:01:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10herron) Looping in @KFrancis for NDA confirmation as well [16:01:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) [16:10:18] (03CR) 10Gehel: [C: 04-1] "Minor comments about style and documentation. Otherwise, LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:14:00] (03CR) 10Jbond: wdqs/data-reload.py: validate dump date (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:15:22] (03CR) 10JHathaway: [C: 03+1] "I think this is the correct approach" [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney) [16:18:09] (03PS10) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [16:18:29] (03CR) 10Bking: wdqs/data-reload.py: validate dump date (WIP) (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:19:36] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:19:56] 10SRE-Access-Requests: Request for SSH Access - https://phabricator.wikimedia.org/T328787 (10KOfori) [16:23:31] (03PS11) 10Bking: wdqs/data-reload.py: validate dump date [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [16:25:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2012.codfw.wmnet [16:25:21] (03PS2) 10Ollie Shotton: Enable WIP Wikibase REST API routes on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) [16:25:57] (03CR) 10CI reject: [V: 04-1] Enable WIP Wikibase REST API routes on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [16:28:02] (03PS3) 10Ollie Shotton: Enable WIP Wikibase REST API routes on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) [16:29:14] (03CR) 10Ollie Shotton: Enable WIP Wikibase REST API routes on beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [16:32:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2012.codfw.wmnet [16:34:47] (03PS12) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [16:35:01] (03CR) 10Elukey: [C: 03+2] httpbb: liftiwing add new API tests [puppet] - 10https://gerrit.wikimedia.org/r/886375 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [16:36:06] (03CR) 10Bking: wdqs/data-reload.py: validate dump date (WIP) (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:36:28] (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:41:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:42:03] (03CR) 10Elukey: [C: 03+1] configmaster: Remove disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [16:44:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:44:27] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudsw1-b1-codfw mgmt IP. - cmooney@cumin1001" [16:44:56] (03CR) 10Jbond: [C: 03+1] wdqs/data-reload.py: validate dump date (WIP) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:45:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudsw1-b1-codfw mgmt IP. - cmooney@cumin1001" [16:45:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:47:21] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1089.eqiad.wmnet with OS bullseye [16:47:28] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1089.eqiad.wmnet with OS bullseye [16:47:35] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1086.eqiad.wmnet with OS bullseye [16:47:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1086.eqiad.wmnet with OS bullseye [16:48:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable WIP Wikibase REST API routes on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [17:09:16] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1086.eqiad.wmnet with reason: host reimage [17:09:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1089.eqiad.wmnet with reason: host reimage [17:12:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1086.eqiad.wmnet with reason: host reimage [17:14:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1089.eqiad.wmnet with reason: host reimage [17:34:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1086.eqiad.wmnet with OS bullseye [17:34:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1086.eqiad.wmnet with OS bullseye completed: - cp1086 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:34:39] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1086.eqiad.wmnet [17:35:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:35:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1088.eqiad.wmnet with OS bullseye [17:35:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1088.eqiad.wmnet with OS bullseye [17:36:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1089.eqiad.wmnet with OS bullseye [17:36:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1089.eqiad.wmnet with OS bullseye completed: - cp1089 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:39:38] 10SRE, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall) [17:39:45] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05Open→03Stalled a:03BCornwall Wow, seven years! Hello to those still around. :) Who is in charge of shopify/store.wikimedia.org nowadays? It would be nice if we cou... [17:39:49] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) p:05Medium→03Low [17:39:49] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet [17:40:18] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:42:49] (03CR) 10EoghanGaffney: Separate log messages from otrs.Daemon.pl to its own log file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [17:43:43] (03PS2) 10BCornwall: Remove ferm rules for Pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [17:53:34] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) >>! In T128559#8586016, @BCornwall wrote: > Wow, seven years! Hello to those still around. :) > > Who is in charge of shopify/store.wikimedia.org nowadays? It would be nice if... [17:57:03] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1088.eqiad.wmnet with reason: host reimage [17:57:06] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp1088.eqiad.wmnet with reason: host reimage [17:57:52] (03CR) 10Dzahn: [C: 03+1] "tested:) lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/886336 (https://phabricator.wikimedia.org/T328635) (owner: 10Jelto) [18:00:21] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) @BCornwall found this "contact: Merchandise@wikimedia.org (response platform), Khansen-ctr@wikimedia.org (Store associate), or Shust@wikimedia.org (Store Manager) " on https:... [18:01:56] 10SRE, 10SRE-Access-Requests: Request for SSH Access for kofori - https://phabricator.wikimedia.org/T328787 (10Dzahn) [18:05:00] (03CR) 10Dzahn: "Maybe let's ask Filippo Giunchedi for his opinion on this (he is in observability and on the change you linked to)" [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [18:06:59] (03CR) 10BCornwall: [C: 03+1] Remove ferm rules for Pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [18:08:36] (03CR) 10Dzahn: "I think it's fine to go ahead with this, but mostly just because Jesse said +1" [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney) [18:13:52] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39385/console" [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [18:19:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1088.eqiad.wmnet with OS bullseye [18:19:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1088.eqiad.wmnet with OS bullseye completed: - cp1088 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [18:20:17] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39386/console" [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [18:23:19] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1088.eqiad.wmnet [18:23:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:23:53] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1090.eqiad.wmnet with OS bullseye [18:23:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1090.eqiad.wmnet with OS bullseye [18:29:12] 10SRE, 10SRE-Access-Requests: Request for SSH Access for kofori - https://phabricator.wikimedia.org/T328787 (10Dzahn) Hi @KOfori I'm assuming this is for global root access on all machines, is that right / the expectation? Or did you have a specific subset of servers in mind? We currently don't have groups for... [18:29:44] 10SRE, 10PyBal, 10Traffic-Icebox: lvs servers report 'Memory allocation problem' on bootup - https://phabricator.wikimedia.org/T82849 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks to @ema for the patch! We're definitely upgraded by now, so setting this as resolved. ` root@lvs1019:/home/brett# i... [18:34:40] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AKhatun - https://phabricator.wikimedia.org/T328734 (10Dzahn) A minor detail that should not concern you, as the access will be the same, but for clinic duty handling this: This should be "wmf" group, not "nda", I think. based on: "WMF staff and con... [18:38:37] 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10Dzahn) a:03Ekalkst Hi @Ekalkst assigning this back to you for now since we need some more input from you. Cheers! [18:39:41] (03CR) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [18:45:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1090.eqiad.wmnet with reason: host reimage [18:49:01] !log Enabling 4x10G channelization for pic 0 QSFP 4 on cr1-codfw [18:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1090.eqiad.wmnet with reason: host reimage [18:54:53] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:55:59] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) Update: We expect to have more info mid-next week. Thanks, everyone for your patience! [18:56:09] ^ did somebody make netbox changes but not run the cookbook ^ ? [18:58:59] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test what is not synced - dzahn@cumin2002" [18:59:26] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Browser-Support-Internet-Explorer, 10Upstream: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595 (10BCornwall) 05Open→03Reso... [19:00:09] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10Browser-Support-Internet-Explorer, 10Upstream: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595 (10BCornwall) 05Resolved→03... [19:00:11] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "test what is not synced - dzahn@cumin2002" [19:04:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10KFrancis) @herron I am confirming AKhatun has an NDA on file. Please proceed with the access request. Thanks! [19:08:23] (03PS5) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 [19:10:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1090.eqiad.wmnet with OS bullseye [19:10:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1090.eqiad.wmnet with OS bullseye completed: - cp1090 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [19:16:41] (03PS1) 10WMDE-Fisch: Fix and add mising parser test for maplink with suppressed text="" [extensions/Kartographer] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886105 (https://phabricator.wikimedia.org/T328739) [19:30:37] (03CR) 10CI reject: [V: 04-1] Fix and add mising parser test for maplink with suppressed text="" [extensions/Kartographer] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886105 (https://phabricator.wikimedia.org/T328739) (owner: 10WMDE-Fisch) [19:42:07] (03CR) 10Dzahn: [C: 03+1] admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron) [19:44:33] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1090.eqiad.wmnet [19:44:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:45:21] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:45:22] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove ferm rules for Pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [19:47:38] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899 (10BCornwall) Thanks so much @Muehlenhoff for the patch! Now that it's been merged we can finally put this to rest. :) [19:47:53] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "The CI failure is because of a different PHPUnit version and should be ignored." [extensions/Kartographer] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886105 (https://phabricator.wikimedia.org/T328739) (owner: 10WMDE-Fisch) [19:48:46] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899 (10BCornwall) 05Open→03Resolved a:03BCornwall [19:53:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) I tried to enable the CR uplinks from the new cloudsw but there is a bit of a snag. The CR do... [20:13:34] (03PS1) 10CDanis: Increase NEL policy ttl to one week (from one day) [puppet] - 10https://gerrit.wikimedia.org/r/886401 [20:18:54] (03CR) 10Ssingh: [C: 03+1] Increase NEL policy ttl to one week (from one day) [puppet] - 10https://gerrit.wikimedia.org/r/886401 (owner: 10CDanis) [20:30:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:40:22] (03PS13) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [20:42:24] (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [20:44:19] (03PS14) 10Bking: wdqs/data-reload.py: validate dump date Kafka retention is 30 days, and it takes around 17 days to complete a data reload. Validate that the dump isn't too old before and after the reload. Also, remove unused --reuse-munge flag. [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [20:44:38] (03PS15) 10Bking: wdqs/data-reload.py: validate dump date [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [20:46:24] (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [20:49:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:51:49] (03PS16) 10Bking: wdqs/data-reload.py: validate dump date [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [20:52:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:54:12] (03CR) 10CDanis: [C: 03+2] Increase NEL policy ttl to one week (from one day) [puppet] - 10https://gerrit.wikimedia.org/r/886401 (owner: 10CDanis) [20:55:53] (03CR) 10Bking: [C: 03+2] wdqs/data-reload.py: validate dump date [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [21:00:17] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:02:16] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudsw1-b1-codfw mgmt IP. - cmooney@cumin1001" [21:04:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudsw1-b1-codfw mgmt IP. - cmooney@cumin1001" [21:04:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:04:30] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:05:33] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:05:38] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:15:36] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic-Icebox, 10Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188 (10BCornwall) 05Open→03Stalled Hi, @hashar! It's been quite a while but is there still any intention to add the CI integration? [21:18:49] 10SRE, 10PyBal, 10Traffic-Icebox: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650 (10BCornwall) This ticket seems contingent on our usage of the sh scheduler. How relevant is it when we eventually switch to the mh scheduler? (T263797) [21:21:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10AKhatun_WMF) [21:32:38] 10SRE, 10Traffic-Icebox: ats-be on the text cluster is experiencing broken connections - https://phabricator.wikimedia.org/T236988 (10BCornwall) @Vgutierrez: https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&from=now-30d&to=now&viewPanel=59 shows a lower frequency of broken connections t... [21:36:58] 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10CRoslof) a:05BCornwall→03CRoslof Thanks for flagging these domain names. I'll make sure they're in our queue to review for trademark enforcement. We should be able to use the [[ https://en.wikiped... [21:55:51] (03CR) 10Dzahn: "Could anyone review this one-liner?" [puppet] - 10https://gerrit.wikimedia.org/r/867714 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [21:58:27] (03CR) 10RLazarus: [C: 03+2] Comments only: Move the test case format documentation to wikitech [software/httpbb] - 10https://gerrit.wikimedia.org/r/886202 (owner: 10RLazarus) [21:59:55] (03Merged) 10jenkins-bot: Comments only: Move the test case format documentation to wikitech [software/httpbb] - 10https://gerrit.wikimedia.org/r/886202 (owner: 10RLazarus) [22:14:56] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite) [23:08:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:09:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:16:13] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) @ssingh, we can agree that this was a NIC issue, yeah? If so, this can be marked as resolved since upgrading the NIC firmware allowed us t... [23:17:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring